Abstract
Motif discovery is gaining increasing attention in the domain of functional data analysis. Functional motifs are typical “shapes” or “patterns” that recur multiple times in different portions of a single curve and/or in misaligned portions of multiple curves. In this paper, we define functional motifs using an additive model and we propose funBIalign for their discovery and evaluation. Inspired by clustering and biclustering techniques, funBIalign is a multi-step procedure which uses agglomerative hierarchical clustering with complete linkage and a functional distance based on mean squared residue scores to discover functional motifs, both in a single curve (e.g., time series) and in a set of curves. We assess its performance and compare it to other recent methods through extensive simulations. Moreover, we use funBIalign for discovering motifs in two real-data case studies; one on food price inflation and one on temperature changes.
Supplementary Information
The online version contains supplementary material available at 10.1007/s11222-024-10537-y.
Keywords: Functional data analysis, Functional motif discovery, Clustering, Biclustering
Background and motivation
The last decades have seen an increasing interest in the analysis of functional data, i.e., data that can be represented as smooth curves. Functional Data Analysis (FDA) methods (see, e.g., Ramsay and Silverman 2005; Ferraty and Vieu 2006; Kokoszka and Reimherr 2017) have been applied in a variety of scientific fields. These include biology, medicine, and genetics, where FDA has been employed to analyze, e.g., the genomic landscape of “jumping genes” and COVID-19 epidemics (Chen et al. 2020; Boschi et al. 2021); neurosciences and psychometrics, where it has been used, e.g., to map cognitive processes analyzing response times and brain imaging data (Buckner et al. 2004; Lila et al. 2017); economics and environmental sciences, where it has been used to explore patterns and generate predictions over space or time concerning air pollution, climate indicators, stock market prices, etc (see, e.g., Das et al. 2019; Ghumman et al. 2020).
In this paper, we focus on functional motif discovery; that is, the identification of typical “shapes” or “patterns” that recur multiple times in different portions of a single curve and/or in misaligned portions of multiple curves. While the ability to identify such patterns offers great promise in multiple scientific fields (Cremona and Chiaromonte 2023), to the best of our knowledge, the notion of functional motif still lacks a rigorous statistical formalization. We define a functional motif Q of length l as a collection of curve portions for with (the occurrences of Q) obeying the additive model
where is the mean of the motif, its portion-specific adjustment, its t-varying adjustment and an error term (see Fig. 1).
Fig. 1.

Examples of ideal functional motifs, obtained by (1) with . In panel A all portions are identical and constant; ( and ). In panel B all portions are identical; (). In panel C portions are constant and parallel; (). Panel D illustrates the general case, with parallel portions sharing the same shape; . In all panels, the dashed line represents the motif mean
In order to discover functional motifs, both in a single curve or in a set of curves (evaluated over a grid of equally spaced points), we develop funBIalign, a multi-step algorithm that requires as input, in addition to the curve(s) themselves, only the discretized length and the minimum cardinality of the motifs to be discovered. funBIalign performs a comprehensive scan of all portions of length of all curves in the data, and arranges them in a dendrogram employing agglomerative hierarchical clustering with complete linkage and a functional generalization of the Mean Squared Residue Score (MSR) — a measure typically employed by biclustering techniques (Pontes et al. 2015). The dendrogram is dynamically cut to identify a set of candidate functional motifs, which are then post-processed to select the most interesting. While hierarchical agglomerations are commonly used for functional clustering (see Ferreira and Hitchcock 2009 for a comparison, and Jacques and Preda 2014 for a survey on functional clustering techniques), the use of means squared residues still represents a novelty in the functional framework. The MSR was originally introduced by Cheng and Church (2000), for discovery and validation of biclusters, i.e. of subsets of rows and columns of a data matrix, and has been widely used since in the multivariate setting (e.g., Liu and Wang 2007; Angiulli et al. 2008; Yang et al. 2005), but its functional generalization, fMSR, has only very recently been employed in functional clustering problems by Galvani et al. (2021) and Di Iorio and Vantini (2023).
Functional motif discovery, while gaining increasing attention, is itself an under-explored area. To the best of our knowledge, the only other approach that deals specifically with it in the FDA domain is probabilistic K-means with local alignment (probKMA, Cremona and Chiaromonte 2023). Following a stream of literature devoted to the simultaneous alignment and clustering of curves at the global level (see, e.g., Liu and Yang 2009; Sangalli et al. 2010), and to the simultaneous domain selection and clustering of curves (Fraiman et al. 2016; Floriello and Vitelli 2017; Vitelli 2023), probKMA identifies candidate functional motifs combining a probabilistic K-means algorithm and local alignment (or domain selection) techniques. A strong point of probKMA is the fact that, while requiring the specification of a minimum motif length, it can extend such length in a motif-specific and data-driven fashion. However, it is designed to operate on a set of curves and it requires extensive post-processing. As a counterpart to its efficacy, the need for multiple initializations of the K-means algorithm and the complexity of post-processing make probKMA computationally expensive. Outside the FDA domain, a problem similar to functional motif discovery has been tackled by the data mining community – seeking patterns embedded multiple times within a single time series (see, e.g. Lila et al. and Patel 2002; Mueen et al. 2009; Yeh et al. 2016). This work are based on a k-Nearest Neighbors algorithm and requires users to specify several parameters, some of which may not be intuitive and significantly affect outcomes (newer versions of these procedures have been published since 2016, but they primarily enhance computational performance on large data sets, rather than modifying required inputs or algorithmic approach).
As mentioned above, and in contrast to existing methods, funBIalign can naturally handle both single curves and sets of curves. In the latter case, thanks to the hierarchy it creates, it can also highlight relationships among curves harboring the same motif. On a different front, although the comprehensive scan employed by funBIalign can be computationally demanding, it must be performed only once – without the need for multiple runs or initializations as in probKMA. More generally, funBIalign can effectively tackle applications where functional alignment fails or may be inadequate, such as identifying motifs embedded consecutively, or appearing only in one curve. We also note that the two input parameters required by our procedure, the motif length and the minimum number of portions , while impacting outcomes, are user-friendly and intuitive.
The remainder of the paper is organized as follows. Section 2 presents the theoretical setting of funBIalign, including the rigorous definition of functional motifs and of the dissimilarity measure employed. The algorithmic implementation is described in Section 3. Finally, the performance of our proposal is assessed through simulations, comparisons with other available methods, and real-data case studies in Sections 4, 5 and 6.
Theoretical setting
Model-based definition of functional motifs
Consider a set of real-valued curves , , each defined on a compact interval which we assume to be without loss of generality. Intuitively, a functional motif is a “shape” or “pattern”, defined on a domain interval of given length l, which occurs multiple times within the set of curves – possibly with noise. A motif can occur at different positions within a curve and/or in misaligned portions of different curves in the set ( corresponds to the case in which motifs are sought within a single curve, e.g., a time series). For each curve , we consider all possible overlapping portions of length l, and we align them on the interval [0, l]. In symbols, a generic portion is with , where L is a sub-interval of length l of and h is a shift transformation from [0, l] to L. Let I be the collection of all portions of length l of all curves in the set, i.e. . We provide a rigorous definition of functional motif using an additive model as follows.
Definition 1
A functional motif Q is a collection of curve portions for such that
| 1 |
where is a mean level, a portion-specific adjustment, a t-varying adjustment, and an error term.
To have unique and identifiable parameters, we impose , , and for and . It is important to remark that, even if the definition above does not consider explicitly the curve to which each portion belongs to, this information is crucial in the algorithm we introduce in Sect. 3.
Definition 1 is inspired by the biclustering literature, and in particular by the definition of multivariate coherent evolution biclusters (Madeira and Oliveira 2004) utilized in the seminal paper by Cheng and Church (2000). In the functional framework, Galvani et al. (2021) used a model similar to the one in (1) to discover functional biclusters in a data matrix whose cells correspond to curves, while Di Iorio and Vantini (2023) employed the model in (1) to identify local clusters in subsets of aligned curves. Following the biclustering literature, we call a functional motif “ideal” when , i.e. in the absence of noise. Such a motif is composed by perfectly parallel portions sharing the same shape (Fig. 1D). Among ideal motifs, special cases can be obtained setting and/or to 0. If both and we have a constant motif, whose portions are all (Fig. 1A). Setting but allowing we have parallel constant portions (Fig. 1C), while setting but allowing all portions will be identical to (Fig. 1B).
Evaluating the coherence of functional motifs
To evaluate the coherence of a candidate motif, i.e. of a collection of portions, to an (unknown) additive model, we gauge the error term using a functional version of the Mean Squared Residue score (MSR), or H-score, first introduced by Cheng and Church (2000) to seek large biclusters among the rows (genes) and columns (experimental conditions) of a gene expression data matrix. Galvani et al. (2021) first proposed a generalization of the MSR to the functional framework to seek large biclusters in matrices of curves. Here, we define the functional Mean Squared Residue score (fMSR) as follows.
Definition 2
The functional Mean Squared Residue score (fMSR) of a functional motif Q is
| 2 |
where the estimates , and of the terms
in (1) are
In forming the estimates, , and represent, respectively, the mean value of each portion , the functional mean of all portions , , and the mean value of (or equivalently, the mean of the ’s). Using them, we can rewrite (2) as
| 3 |
which allows us to easily implement the score calculation. We observe that the fMSR of an ideal, i.e. noiseless, functional motif Q is . Thus, similar to biclustering methods which look for biclusters with low MSR, we look for functional motifs with low fMSR.
Di Iorio et al. (2020) recently proved that the MSR is biased towards small biclusters (i.e., biclusters with a small number of rows and/or columns). Here, we prove that the fMSR suffers from a similar bias towards small functional motifs; that is, motifs comprising a small number of portions. This can hinder motif discovery, as it distorts the comparisons of motifs comprising different numbers of portions. The following theorem fully characterizes this bias and suggests a way of correcting it (a proof is provided in Section S1 and the bias in the multivariate framework is extensively treated and illustrated through simulations in Di Iorio et al. 2020).
Theorem 1
Let Q be a functional motif. For let be the average fMSR of all sub-motifs of Q obtained selecting exactly n of the portions belonging to Q. Then
| 4 |
An implication of Theorem 1 is that the ratio depends on the number of portions n included in the sub-motif. It is also straightforward to verify that
| 5 |
for every , allowing one to compute using alone. Hence, we have that
| 6 |
The infinite product in (6) converges and we have . As a consequence, the bias can be at most 1, and it decreases when considering motifs comprising a large number of portions (see Di Iorio et al. 2020, for simulations and additional details in the multivariate case). However, its effects can be troublesome in comparisons involving seldom motifs, which will be non-negligibly favored. For this reason, we propose to correct the bias defining an adjusted version of the fMSR as follows.
Definition 3
Let Q be a functional motif comprising portions. Its adjusted functional mean squared residue score is defined as:
| 7 |
We note that the correction is straightforward and it requires only a multiplication of the fMSR score by a factor - with negligible additional computational burden. The adjusted fMSR is the measure we employ in the remainder of the paper (a numerical comparison of the proposed algorithm with adjusted and non-adjusted fMSR score is available in Section S2).
An fMSR-based dissimilarity measure
The adjusted fMSR can be used to construct a dissimilarity measure between two curve portions and in [0, l] as
| 8 |
where is the functional motif composed only by the portions and . According to this definition, if and only if W is an ideal motif.
The funBIalign algorithm
Given a set of real-valued curves defined on , , funBIalign discovers recurrent and coherent motifs as defined by (1). The algorithm considers the evaluation of the curves over a grid of equally spaced points and requires as input the discretized motif length (the number of grid points corresponding to the length l in (1)) and the minimum number of portions . It comprises four steps, as described below and presented in the schematic of Fig. 2).
Fig. 2.
A summary schematics of the three main steps of the funBIalign algorithm
Step 1 — Portion creation and alignment. For every curve , we create all portions of length (starting at , etc.), and align them so that their domains all start at . We indicate the resulting overall set of aligned portions with , , where . For each portion, we keep track of the originating curve i(j) and of the grid points occupied.
Step 2 — Hierarchical clustering based on the fMSR dissimilarity. We compute the fMSR-based dissimilarity of every portion pair , , and calculate , i.e. the dissimilarity of the most dissimilar portion pair. Note that since all portions are evaluated over a grid, the integrals involved in computing dissimilarities are approximated by sums. Whenever two portions originate from the same curve and share at least of their grid points, we name them “acolytes” and artificially increase their dissimilarity by M, to prevent obvious similarities among curve portions with large overlaps from dominating the agglomeration. Our dissimilarities are thus
| 9 |
and we use their matrix to perform agglomerative hierarchical clustering with complete linkage (see, e.g., Murtagh and Contreras 2012). Let Tree indicate the resulting dendrogram, whose nodes represent clusters of portions. Due to the use of complete linkage and to the addition of M, the longest dendrogram branches occur when two nodes comprising acolytes are merged (see Fig. 2 Step 2). We use such branches to cut the dendrogram, generating sub-trees , , which do not contain acolytes.
Step 3 — Collection of candidate functional motifs. For every sub-tree , , we consider the sets of all nodes and all leaf nodes (), and for every we consider the sets of all descendants des(x) and ascendants asc(x). We define the seeds of as the set of nodes with at least leaf descendants, and with no descendant meeting the same criterion; in symbols
| 10 |
Note that a sub-tree can have zero, one or multiple seeds. Finally, we define the family of as the union of v and all its ascendants that are not shared with other seeds:
| 11 |
Next, we select a “recommended representative” for the family. If , this is trivially the one family member. If , we sort the nodes in order of increasing cardinality |x| (i.e. number of portions) and consider (adjusted fMSR value computed on the portions). If increases with rank(|x|), we use an “elbow” approach and select the node just before the maximum increase. Otherwise, we select the node with minimum . Our collection of candidate motifs is composed of recommended representatives of all families from all sub-trees.
Step 4 — Post-processing. We sort candidate motifs based either on adjusted fMSR, i.e. , or on a combination of adjusted fMSR and inverse cardinality, i.e. (alternative ranking criteria can be used, see Section 6). Starting from the top, we compare each motif to those with higher rank. If all portions of x are acolytes to portions of an higher ranking motif, we filter it out; otherwise we retain it. This produces a final collection of discovered motifs .
We remark that Step 2 relies on M and the definition of acolytes. While larger values of M do not affect the solution, the way we define acolytes (e.g., the percent overlap) can have an impact on the results; this definition may be changed depending on applications and user needs. In addition, the dynamic cut in Step 3 selects candidate motifs controlling a trade off between cohesiveness (small adjusted fMSR) and prevalence (large number of occurrences). In contrast, the post-processing in Step 4 eliminates overlapping candidates which might have been selected in different sub-trees. We note that Step 4 actually provides an “importance ranking” which can be used, more generally, to further explore and select among candidates. This is particularly useful in applications where the algorithm identifies a very large number of candidate motifs.
Simulations
We assess the performance of funBIalign through an extensive simulation study. To simulate a smooth curve embedding multiple occurrences of functional motifs we use the flexible B-spline-based model proposed by Cremona and Chiaromonte (2023). Briefly, this model generates a smooth curve as where is a B-spline basis of order with equally spaced knots and , are real coefficients. In every simulation conducted in our study, we generate a single smooth curve of length 7000 using order and knots at distance . The curve is then evaluated across a grid of 7001 equally spaced points (). We then randomly embed in the curve the same number (8 or 10) of occurrences of 4 distinct motifs of length (i.e. , points). Coefficients defining both the curve and the motifs are randomly generated from a Beta(0.45, 0.45) and then rescaled to . We incorporate into every motif occurrence a vertical shift drawn uniformly from as well as noise – adding to the coefficients independent draws from . We note that this way of generating motifs within a curve does not match the additive model in (1); thus, in our simulations we are challenging funBIalign with motifs that may be harder for it to identify, showcasing the flexibility of the algorithm. We also note that the curve background can, by chance, comprise segments that resemble one of the motifs; that is, extra portions that were not intentionally embedded but do follow the motif pattern. Moreover, the background may reveal entirely different and distinguishable motifs; that is, patterns that, while not intentionally inserted in multiple occurrences, happen to repeat themselves along the curve. Extra portions or additional motifs emerging from the background introduce an added complexity in interpreting simulation results (see below). Because of this complexity, we prefer to refer to motifs and occurrences as “embedded” (vs. not embedded) instead of “true” (vs. false).
We construct a total of 100 simulations. We consider 10 alternative motifs sets, each one with 4 distinct motifs; 2 alternative numbers of occurrences (8 or 10) which, for simplicity, are the same for all motifs; and 4 alternative levels of noise, expressed by 0.1, 0.5, 1 or 2. In a first batch of simulations, all motifs share the same , and we use all values in turn. In a second batch of simulations, each motif is attributed a different . We run funBIalign setting (the true length of the motifs) and using different minimum cardinalities (progressively closer to the true number of occurrences, 8 or 10). We post-process the candidate motifs produced in each run, ranking them according to two criteria – the adjusted fMSR and the rank sum – and identifying results which are most similar to the intentionally embedded motifs (the ones we use as targets). Here we focus on those; that is, on the detection of occurrences of the motifs we embedded in the curve, plus potential extra portions of the background that the algorithm associates with such motifs. funBIalign may “discover” in the background motifs other than those we created, but we will ignore them in the discussion to follow. We also restrict the main text presentation to the 40 simulations where motifs have 8 occurrences and shared noise levels. Results for simulations where motifs have 10 occurrences and shared noise levels, or where motifs have 8 or 10 occurrences and different noise levels, are entirely consistent with those presented here and are provided in Sections S4 and S5.
Fig. 3A summarizes performance for the 40 simulations, pooling results across algorithm runs with different ’s. For every level of we display two boxplots, each comprising values: for each of the 4 motifs in each of the 10 sets and across the 4 runs, we count the number of portions correctly identified (left boxplot) and the number of extra portions (right boxplot). We can see that funBIalign is quite effective in identifying embedded motif occurrences. However, as expected, as the level of noise increases some embedded portions are missed, and some extra portions are found – though these are usually very similar to the embedded portions (see Fig. 5). Similar results, again pooling across runs with different ’s, but separating each of the 10 alternative motif sets, are provided in Fig. S4. Fig. 3C shows results (identification of true occurrences and of extra portions) for one motif set, n. 7, but separately for the 4 runs with varying . The minimum cardinality used in the algorithm can indeed impact performance, especially through the ranking of the candidate motifs (see Figs. S5-S6). When is much lower than the true number of occurrences, rank sum is preferable to adjusted fMSR as a ranking criterion, because it tends to privilege results more similar to the embedded motifs. On the other hand, when is close to the true number of occurrences, best results do not necessarily have highest rank sums. This fact can hinder their identification. In addition, as expected, when is too low we miss some occurrences and, as gets higher, we identify more extra portions. Fig. 3B shows the first half of the curve for the simulation using motif set n. 7, with noise level , color-coding motif occurrences interspersed across the curve. Fig. 4 provides details on the motifs identified in the simulation with noise level , and Fig. 5 on those corresponding to , running funBIalign with in both cases. Performance is excellent, though noisier motifs, as expected, cause a slight deterioration. We also see how, at the same level of noise, motifs may be easier or harder to identify depending on their shapes. Importantly, we note that when funBIalign identifies extra portions, i.e. segments of the background that resemble a motif by chance, these are indeed very similar to the occurrences intentionally embedded in the curve (Fig. 5). In effect, they should be thought of as “unplanned” true positives, not as false positives.
Fig. 3.
A Performance of funBIalign for all simulations where motifs have 8 occurrences and shared noise levels. The algorithm is run with minimum cardinalities and results are pooled. For each , the green and pink boxplots (left and right) represent correctly identified portions (the ones which were embedded in the simulation, and correctly identified by funBIalign) and extra portions (the ones which were not embedded in the simulation, but still identified by funBIalign), respectively. B First half of the curve for a simulation employing motif set n. 7 and . Occurrences of the motifs are color-coded. The three wired dots (...) express the fact that the curve continues. C Performance of funBIalign for simulations employing motif set n. 7, shown separately for runs with varying . Green and pink jittered dots represent correctly identified portions and extra portions, respectively
Fig. 5.
Motif identification for the simulation employing motif set n. 7 and . funBIalign is run with . See legend for Fig. 4. With high noise, 2 embedded occurrences are missed, and 4 extra occurrences are identified (represented in gray among the identified portions panels)
Fig. 4.
Motif identification for the simulation employing motif set n. 7 and . funBIalign is run with . For each of the 4 motifs, color-coded in green, red, blue and yellow, left and right panels show the 8 occurrences embedded in the curve and the most similar portions identified by the algorithm, respectively. The table provides rankings and numbers of correctly identified, extra and missing occurrences for each motif. With low noise, no embedded occurrences are missed, and no extra occurrences are identified
In our simulation study we treat as a known quantity, using its true value. However, in many real data applications a reasonable value for may not be known in advance. In such settings, provided the user can at least identify a reasonable range for , we suggest to run the algorithm for each in such range, and compute the average adjusted fMSR score of the top motifs. An elbow method strategy can then be employed to select a satisfactory value within the explored range. Figure S2 shows the results of this procedure on the simulations employing motif set n. 7.
We studied how the choice of could impact the correct and complete identification of motifs. We simulated four versions of a 7001 point long curve embedding 20 instances of the same motif, using four levels of noise; , and 2. We then verified the correct identification of the motif with values of that differ markedly from the actual number of embedded portions; namely, 4. As shown in Figure S3, a too low can result in incomplete identification, especially for higher levels of noise. However, in general, the identification is complete or almost complete even if . Conversely, a too high forces the algorithm to identify extra portions – in addition to the embedded 20, which are all identified. This highlights the importance of a reasonable choice of , but also shows that the proposed method has some tolerance towards non-trivial misspecifications.
Comparisons with related methods
We compare funBIalign with two different methods; namely, probKMA (Cremona and Chiaromonte 2023), a functional motif discovery algorithm employing probabilistic K-mean with local alignment, and a motif discovery method based on SCRIMP-MP (Zhu et al. 2018), one of the most recent Matrix Profile (MP) algorithms for motif discovery in univariate time series. We focus on the ability of the three methods to correctly identify embedded motifs. Concerning computational burden, we restrict ourselves to a targeted comparison between funBIalign and probKMA, the other functional-native method (see below; a rigorous, more comprehensive comparison is hard to perform due to differences in coding languages, degree of code optimization, and overall structure of the pipelines – e.g., number and nature of algorithmic and post-processing parameters to be fixed or tuned).
funBIalign and probKMA differ in various respects. funBIalign employs agglomerative clustering with adjusted fMSR, which generates a complete hierarchy of all curve portions. In contrast, probKMA relies on local functional K-means with Sobolev distance. Moreover, while probKMA can extend the length of motifs endogenously and discover motifs of varying and unknown sizes, starting from a user-defined set of minimum motif lengths, funBIalign requires the length of all motifs to be the same and fixed beforehand. However, being based on K-means, probKMA must be run several times with different initialization, which can add to the computational cost. In addition, probKMA implements a more complex post-processing, involving several tuning parameters – and this could make the method less attractive for non-specialized user. Finally, funBIalign has the benefit of being equally applicable to a single curve or to sets of curves, whereas probKMA is designed to operate on sets of curves; to use it in applications involving a single curve, this must be split at the outset – which requires further arbitrary choices, and may potentially lead to the loss of interesting motifs. We compare the performance of the two algorithms using two simulation settings for scenario (1) introduced in Section 4.2 of Cremona and Chiaromonte (2023). In both, two motifs of length 61, say A and B, occur each 12 times across 20 curves. In particular, 12 curves contain 1 occurrence of a single motif (of A for 6, and of B for 6 curves), 4 curves contain 2 occurrences of a single motif (of A for 2, and of B for 2 curves), 2 curves contain 1 occurrence of both motifs, and 2 curves contain no motif. The two settings differ in terms of length of the curves L and of noise incorporated in the motifs; one has short curves and low noise ( and ), and the other long curves and high noise ( and ).
We run funBIalign with (the true length of the motifs) and , and post-process candidate motifs with the rank sum criterion. probKMA is run using , minimum motif length , and 20 random initializations for each (K, v) pair; results from the 120 runs with different parameters/initializations are pooled following the motif discovery post-processing recommended in Cremona and Chiaromonte (2023). The whole procedure is repeated 10 times, and medians are taken over these 10 repetitions when evalutating performance. Results are summarized in Table 1. As expected, the choice of has an impact on the performance of funBIalign: smaller values can lead to missed portions, and larger values to extra portions. However, with an appropriate , funBIalign can match and even exceed the performance of probKMA. In terms of computational burden, we compare the current implementations of funBIalign and probKMA in the aforementioned simulated setting with and . Specifically, we run funBialign with and , and post-process candidate motif length with the rank sum criterion. Conversely, we run probKMA using , minimum motif length , no motifs elongation, and 10 random initializations for each (K, v) pairs. Both methods are run 10 times on a local machine (64GB of memory, 8 performance cores and 2 efficiency cores). On average, the current implementation of funBIalign outperforms probKMA: while our method requires 1.69 seconds to run sequentially and 1.29 seconds to run in parallel on 10 cores, probKMA runs in 69.96 and 24.25 seconds, respectively.
Table 1.
Comparison between funBIalign and probKMA
| Setting | Motif | Portion | probKMA | funBIalign - | |
|---|---|---|---|---|---|
| Motif A | Correct | 12 | 12 | 12 | |
| Extra | 0 | 0 | 0 | ||
| Motif B | Correct | 12 | 12 | 12 | |
| Extra | 0 | 0 | 0 | ||
| Motif A | Correct | 11 | 12 | 12 | |
| Extra | 2 | 1 | 1 | ||
| Motif B | Correct | 12 | 10 | 12 | |
| Extra | 1 | 0 | 3 | ||
For probKMA, median results across 10 runs are reported
We next consider the motif discovery method based on SCRIMP-MP as implemented in the R library tsmp. This method employs a k Nearest Neighbours (kNN) routine and a similarity measure based on the z-normalized Euclidean distance, requiring the user to fix several parameters ex ante; namely, the number of motifs to be discovered u, their length , a motif radius R (i.e. , the distance within which two portions of the time series are identified as belonging to the same motif), and a maximum number of neighbors to consider . Specifically, the algorithm will starts from pairs of most similar portions, to which other portions of the curve are added only if they are within a distance R from the starting pair, and only up to a maximum of . In contrast to probKMA, this method is designed to operate with a single time series; to use it in applications involving a set of curves, those need to be joined – which can be problematic. Another MP-based method called Ostinato (Kamgar et al. 2019) can find “consensus“ motifs shared by some curves in a set; however, the definition of “consensus“ motif differs from that of functional motif since it considers only the best matching occurrence in each curve. We compare the performance of funBIalign and the SCRIMP-MP based method using two curves from our own simulation study in Section 4, namely the ones employing motifs set n. 7 with 8 occurrences each, with shared noise level and , respectively (see Figures 4 and 5). We run funBIalign with (the true length of the motifs) and . We use the SCRIMP-MP method, as implemented in the R library tsmp, to identify a maximum of motifs with . All possible combinations of and are tested. Table 2 summarizes the results most similar to the embedded motifs. Also here the choice of has an impact on the performance of funBIalign, but R and have a yet stronger impact on the performance of SCRIMP-MP, which struggles more – especially with higher noise. In addition, tuning these two parameters is not trivial, because intuition and knowledge about them is unlikely to be available a priori. A similar comparison using motifs with different noise levels is presented in Section S6.
Table 2.
Comparison between funBIalign and SCRIMP-MP
| Motif | funBIalign - | MP | R | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 6 | 7 | 8 | 3 | 10 | 25 | 50 | |||
| Motif 1 | (8,0) | (8,0) | (8,0) | (8,0) | 4 | (6,0) | (6,0) | (6,0) | (6,0) | |
| 6 | (6,0) | (8,0) | (8,0) | (8,0) | ||||||
| 8 | (7,3) | (8,0) | (8,2) | (8,2) | ||||||
| Motif 2 | (8,0) | (8,0) | (8,0) | (8,0) | 4 | (6,0) | (6,0) | (6,0) | (6,0) | |
| 6 | (8,0) | (8,0) | (8,0) | (8,0) | ||||||
| 8 | (8,0) | (8,0) | (8,2) | (8,2) | ||||||
| Motif 3 | (6,0) | (8,0) | (8,0) | (8,0) | 4 | (6,0) | (6,0) | (6,0) | (6,0) | |
| 6 | (8,0) | (8,0) | (8,0) | (8,0) | ||||||
| 8 | (8,0) | (8,0) | (8,2) | (8,2) | ||||||
| Motif 4 | (8,0) | (8,0) | (8,0) | (8,0) | 4 | (5,0) | (6,0) | (6,0) | (6,0) | |
| 6 | (5,0) | (8,0) | (8,0) | (8,0) | ||||||
| 8 | (5,0) | (8,1) | (8,2) | (8,2) | ||||||
| Motif 1 | (7,0) | (8,1) | (8,1) | (8,1) | 4 | (3,0) | (6,0) | (6,0) | (6,0) | |
| 6 | (3,0) | (8,0) | (8,0) | (8,0) | ||||||
| 8 | (3,0) | (8,2) | (8,2) | (8,2) | ||||||
| Motif 2 | (7,2) | (7,2) | (8,1) | (8,1) | 4 | (5,1) | (4,2) | (4,2) | (4,2) | |
| 6 | (6,2) | (5,3) | (5,3) | (5,3) | ||||||
| 8 | (7,3) | (6,4) | (6,4) | (6,4) | ||||||
| Motif 3 | (6,1) | (8,0) | (8,0) | (8,0) | 4 | (5,1) | (5,1) | (5,1) | (5,1) | |
| 6 | (5,1) | (6,2) | (6,2) | (6,2) | ||||||
| 8 | (5,1) | (7,3) | (7,3) | (7,3) | ||||||
| Motif 4 | (6,1) | (7,1) | (7,1) | (7,1) | 4 | (5,1) | (5,1) | (5,1) | (5,1) | |
| 6 | (5,3) | (5,3) | (5,3) | (5,3) | ||||||
| 8 | (6,4) | (7,3) | (7,3) | (7,3) | ||||||
Cases in which all occurrences are correctly identified without the addition of any extra portions are in bold
Case studies
In this section, we assess the ability of funBIalign to discover functional motifs in real data sets through two case studies. The first concerns food price inflation over a period of around 60 years, and the second temperature changes over a period of around 30 years. In both cases we utilize monthly measurements across the world provided by FAO.1
Case study 1: food price inflation
The FAOSTAT data used in this case study comprise monthly food price inflation measurements from January 2001 to June 2022 (a total of 258 measurements) for different countries and geographical regions (details are available in the repository metadata section, and geographical regions, as defined by the United Nations, are in Section S7). We first seek motifs in a single, world-wide food price inflation curve, and then seek motifs in the curves for 19 distinct geographical regions. Before running funBIalign, we smooth the data using local polynomials with Gaussian kernels (locpoly function of the R package KernSmooth, Wand and Ripley 2006); a bandwidth parameter equal to 1.5 seems appropriate to avoid over-smoothing possibly interesting peaks in the data.
For the world-wide curve, we seek annual patterns setting (months), and fix the minimum cardinality to ; a small is appropriate since we are considering a total of only 258 measurements. funBIalign identifies 30 motifs capturing various types of shapes. Two, depicting “valleys” and “peaks” are displayed in Fig. 6A. Notably, the “peak” motif has occurrences corresponding to well-known economic crises; the 2001 recession, the global financial crisis of 2007-2008, and the 2020 COVID-19 recession. For the 19 regional curves, which comprise overall many more measurements, we seek longer, biannual patterns setting (months) and fix a larger minimum cardinality . Here funBIalign identifies 415 motifs. To sieve through such a large output, we suggest that users post-process and rank results based on various approaches. For instance, if one is interested in cohesive results regardless of how frequently a motif recurs, the adjusted fMSR criterion is the best choice. In contrast, if cardinality is important, it may be preferable to utilize the rank sum criterion. We note that, due to the definition of fMSR, motifs with lower variance - which look constant - tend to rank higher, a fact that could overshadow some patterns. For instance, among the 415 motifs identified in the 19 regional curves, some rather interesting high variance motifs are at the bottom of the fMSR ranking. Two motifs are shown in Fig. 6B. One, occurring 6 times, is the top-ranking in fMSR; it depicts a “mild ascent”, with low variance. The other, occurring 9 times, is the top-ranking in terms of variance; it depicts a “peak followed by a valley” and ranks only in fMSR.
Fig. 6.
Two of the motifs identified by funBIalign in the world-wide food price inflation curve (panel A) and two of the motifs identified in the curves for 19 geographical regions (panel B). In both panels, we show the top-ranking adjusted fMSR motif (bottom left plot) along with a high-ranking motif in terms of variance (bottom right plot) – which, due to the criterion definition, may rank relatively low in terms of fMSR. In the top plot, N, S, E, W, and C stand for Northern, Southern, Eastern, Western, and Central, respectively
We end this section remarking on the very steep ascent in food price inflation at the end of the time domain covered by the data (2021 and first half of 2022). This can be noticed in the world-wide curve (Fig. 6A), and it appears also in some of the 19 regional curves (Fig. 6B) – in particular that of Western Asia, which raises to and is cut in the plot. Digging deeper, the rise in the Western Asia curve seems itself driven by countries such as Lebanon and the Syrian Arab Republic, which were experiencing conflicts and thus likely additional inflation pressures on top of world-wide COVID-19 related trends.
Case study 2: Temperature changes
Data used in this case study comprise monthly measurements of temperature changes with respect to a baseline climatology corresponding to the average temperature in the period 1951-1980 for different countries and geographical regions (further details can be found in the repository metadata section). We focus again on the 19 geographical regions defined by the United Nations, and on a period from 1961 to 2021, for a total of 732 monthly measurements (the measurements for a region are obtained by averaging those for the countries belonging it). Also in this case study, before running funBIalign, we smooth the data – but this time we use cubic smoothing B-splines with knots at each month and roughness penalty on the curve second derivative. The smoothing parameter is selected minimizing the average generalized cross-validation error across curves.
In this analysis we seek long, 10-year patterns setting (months). Considering the relative size of motifs to time period covered by the data, we fix again a small minimum cardinality . Also here funBIalign identifies a very large number of motifs; 2367. We post-process and rank them using three different approaches: adjusted fMSR, rank sum, and variance. Top-ranking motifs based on each and the 19 curves are shown in Fig. 7. Notably, a motif can occur simultaneously in multiple regions, and then separately in others – e.g., the top variance motif, with peaks as much as 4C above baseline, characterizes East Europe and North Africa from the mid ’80s to the mid ’90s, and then Central Asia almost 30 years later. Also notably, occurrences of different motifs can overlap – e.g., occurrences of the top fMSR and the top rank sum motifs in East Africa, North Africa and South America in the ’70s present a large overlap, almost extending one another along the time domain. We end pointing out the increasing upward departures from baseline climatology shown by temperatures in all 19 regions over the 60 years covered by the data.
Fig. 7.
Three motifs identified by funBIalign in the temperature change curves for 19 geographical regions. We show the top-ranking variance (top right – 3 portions), adjusted fMSR (middle right – 3 portions) and rank sum (bottom right – 4 portions) motifs. In the left plot, N,S,E,W,C and M stand for Northern, Southern, Eastern, Western, Central, and Middle respectively
Conclusions
We contribute to the recent literature on functional motif discovery a definition of functional motifs based on an explicit additive model, and an algorithm designed to discover such motifs; funBIalign. Related to our additive model, we also introduce the adjusted functional Mean Squared Residue (fMSR) score. The fMSR is a functional extension of the MSR score widely used in the multivariate biclustering literature. Building upon our own past work (Di Iorio et al. 2020), we prove it to be biased towards motifs that occur less often in the data, and formulate a de-biasing adjustment. funBIalign is a very flexible multi-step algorithm which requires only two, easy to interpret input parameters: the length and the minimum number of occurrences of the motifs to be discovered. It uses agglomerative clustering to produce a hierarchy capturing the relationships among curve portions, followed by a dynamic cutting procedure to identify the most interesting candidate motifs based on such hierarchy, and by a post-processing step to eliminate redundant results.
Notwithstanding the simplistic additive model that underlies the fMSR score, funBIalign shows very good performance both in extensive simulations and in two real-data case studies. However, the selection of could be problematic in some applications, requiring the user to run the algorithm with a range of alternative values for this tuning parameter. On a different front, as noted discussing our simulation study, funBIalign can identify extra instances of a given motif, or even entirely different motifs, that were not intentionally embedded in the simulated curves. While this is an explainable bi-product of the procedure used to create our simulated curves (certain patterns may occur and recur by chance in the curve background), it points to the need for a more rigorous statistical treatment of motif discovery. In particular, in future work, we plan to develop a measure of significance to be used in conjunction with the adjusted fMSR for motif discovery. In addition, we are planning to extend the algorithm as to allow the identification of motifs invariant to vertical scaling (this would refer to an underlying multiplicative model, instead of the current additive one).
Parsing through all curve portions of a given length allows funBIalign to tackle applications where classical functional alignment fails or may be inadequate, such as identifying motifs embedded consecutively, or appearing only in one curve. However, this comes at a cost; storing the dissimilarity matrix may necessitate a massive amount of memory when analyzing large functional data sets. Nonetheless, we note that this matrix is only calculated once, and in our simulations (each involving one curve encompassing 7001 measurements) and case studies (involving a maximum of 19 curves of 732 measurements) memory did not pose a challenge;
in the examples we considered, funBIalign runs in under 1 minute (or even less as shown in the comparison with probKMA) without RAM problems on a local machine (64GB of memory, 8 performance cores and 2 efficiency cores). We do believe that running time could be further improved leveraging parallelization even more. In future work, we plan to optimize funBIalign to allow it to scale efficiently also on much larger datasets.
Supplementary information
Supplement material is available online. Current R code for the algorithm is available in the GitHub repository https://github.com/JacopoDior/funBIalign and an R package is in preparation.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
J. Di Iorio acknowledges the support of the Penn State Eberly College of Science. M.A. Cremona acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), of the Fonds de recherche du Québec Health (FRQS), and of FSA, Université Laval. F. Chiaromonte acknowledges support from the Huck Institutes of the Life Sciences, Penn State.
Author Contributions
All authors conceived ideas and analysis approaches. J.Di I., implemented the methodology and performed analyses. All authors interpreted findings and participated to the writing of the manuscript.
Funding
The authors have no relevant financial or non-financial interests to disclose.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Data used in these cases studies can be downloaded at https://www.fao.org/faostat/en/#data/ET.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- Angiulli, F., Cesario, E., Pizzuti, C.: Random walk biclustering for microarray data. Inf. Sci. 178(6), 1479–1497 (2008) [Google Scholar]
- Boschi, T., Di Iorio, J., Testa, L., Cremona, M.A., Chiaromonte, F.: Functional data analysis characterizes the shapes of the first COVID-19 epidemic wave in Italy. Sci. Rep. 11, 17054 (2021). 10.1038/s41598-021-95866-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buckner, R.L., Head, D., Parker, J., Fotenos, A.F., Marcus, D., Morris, J.C., Snyder, A.Z.: A unified approach for morphometric and functional data analysis in young, old, and demented adults using automated atlas-based head size normalization: reliability and validation against manual measurement of total intracranial volume. Neuroimage 23(2), 724–738 (2004) [DOI] [PubMed] [Google Scholar]
- Chen, D., Cremona, M.A., Qi, Z., Mitra, R.D., Chiaromonte, F., Makova, K.D.: Human L1 transposition dynamics unraveled with functional data analysis. Mol. Biol. Evol. 37, 3576–3600 (2020). 10.1093/molbev/msaa194 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng, Y., Church, GM.: Biclustering of expression data. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, La Jolla, CA, pp. 93–103 (2000) [PubMed]
- Cremona, M.A., Chiaromonte, F.: Probabilistic
-means with local alignment for clustering and motif discovery in functional data. J. Comput. Graph. Stat. (2023). 10.1080/10618600.2022.2156522 - Das, S., Demirer, R., Gupta, R., Mangisa, S.: The effect of global crises on stock market correlations: evidence from scalar regressions via functional data analysis. Struct. Chang. Econ. Dyn. 50, 132–147 (2019) [Google Scholar]
- Di Iorio, J., Chiaromonte, F., Cremona, M.A.: On the bias of h-scores for comparing biclusters, and how to correct it. Bioinformatics 36(9), 2955–2957 (2020) [DOI] [PubMed] [Google Scholar]
- Di Iorio, J., Vantini, S.: funloci: a local clustering algorithm for functional data. arXiv:2305.12991 (2023)
- Ferraty, F., Vieu, P.: Nonparametric functional data analysis: theory and practice (2006)
- Ferreira, L., Hitchcock, D.B.: A comparison of hierarchical methods for clustering functional data. Commun. Stat.-Simul. Comput. 38(9), 1925–1949 (2009) [Google Scholar]
- Floriello, D., Vitelli, V.: Sparse clustering of functional data. J. Multivar. Anal. 154, 1–18 (2017) [Google Scholar]
- Fraiman, R., Gimenez, Y., Svarc, M.: Feature selection for functional data. J. Multivar. Anal. 146, 191–208 (2016) [Google Scholar]
- Galvani, M., Torti, A., Menafoglio, A., Vantini, S.: Funcc: a new bi-clustering algorithm for functional data with misalignment. Comput. Stat. Data Anal. 160, 107219 (2021) [Google Scholar]
- Ghumman, A.R., Haider, H., Shafiquzamman, M.: Functional data analysis of models for predicting temperature and precipitation under climate change scenarios. J. Water Clim. Chang. 11(4), 1748–1765 (2020) [Google Scholar]
- Jacques, J., Preda, C.: Functional data clustering: a survey. Adv. Data Anal. Classif. 8(3), 231–255 (2014) [Google Scholar]
- Kamgar, K., Gharghabi, S., Keogh, E.: Matrix profile xv: Exploiting time series consensus motifs to find structure in time series sets. In 2019 IEEE International Conference on Data Mining (ICDM), pp. 1156–1161. IEEE (2019)
- Kokoszka, P., & Reimherr, M. (2017). Introduction to Functional Data Analysis (1st ed.). Chapman and Hall/CRC. 10.1201/9781315117416
- Lila, E., Aston, JA., Sangalli, LM.: Functional data analysis of neuroimaging signals associated with cerebral activity in the brain cortex. In Functional Statistics and Related Fields, pp. 169–172. Springer (2017)
- Liu, X., Wang, L.: Computing the maximum similarity bi-clusters of gene expression data. Bioinformatics 23(1), 50–56 (2007) [DOI] [PubMed] [Google Scholar]
- Liu, X., Yang, M.C.: Simultaneous curve registration and clustering for functional data. Comput. Stat. Data Anal. 53(4), 1361–1376 (2009) [Google Scholar]
- Lonardi, J., Patel, P.: Finding motifs in time series. In Proc. of the 2nd Workshop on Temporal Data Mining, pp. 53–68 (2002)
- Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinf. 1(1), 24–45 (2004) [DOI] [PubMed] [Google Scholar]
- Mueen, A., Keogh, E., Zhu, Q., Cash, S., Westover, B.: Exact discovery of time series motifs. In Proceedings of the 2009 SIAM international conference on data mining, pp. 473–484. SIAM(2009) [DOI] [PMC free article] [PubMed]
- Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2(1), 86–97 (2012) [Google Scholar]
- Pontes, B., Giráldez, R., Aguilar-Ruiz, J.S.: Biclustering on expression data: a review. J. Biomed. Inform. 57, 163–180 (2015) [DOI] [PubMed] [Google Scholar]
- Ramsay, J., Silverman, B.W.: Functional data analysis (2005), Springer
- Sangalli, L.M., Secchi, P., Vantini, S., Vitelli, V.: K-mean alignment for curve clustering. Comput. Stat. Data Anal. 54(5), 1219–1233 (2010) [Google Scholar]
- Vitelli, V.: A novel framework for joint sparse clustering and alignment of functional data. J. Nonparametr. Stat. 36(1), 182–211 (2024) [Google Scholar]
- Wand, M., Ripley, B.: Kernsmooth: Functions for kernel smoothing for wand & jones (1995). R package version 2, 22–19 (2006) [Google Scholar]
- Yang, J., Wang, H., Wang, W., Yu, P.S.: An improved biclustering method for analyzing gene expression profiles. Int. J. Artif. Intell. Tools 14(05), 771–789 (2005)
- Yeh, CCM., Zhu, Y., Ulanova, L., Begum, N., Ding, Y., Dau, HA., Silva, DF., Mueen, A., Keogh, E.: Matrix profile i: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In 2016 IEEE 16th international conference on data mining (ICDM), pp. 1317–1322. Ieee (2016)
- Zhu, Y., Yeh, CCM., Zimmerman, Z., Kamgar, K., Keogh, E.: (2018). Matrix profile xi: Scrimp++: time series motif discovery at interactive speeds. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 837–846. IEEE
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






