Abstract
Tongue motion during speech and swallowing involves synergies of locally deforming regions, or functional units. Motion clustering during tongue motion can be used to reveal the tongue’s intrinsic functional organization. A novel matrix factorization and clustering method for tissues tracked using tagged magnetic resonance imaging (tMRI) is presented. Functional units are estimated using a graph-regularized sparse non-negative matrix factorization framework, learning latent building blocks and the corresponding weighting map from motion features derived from tissue displacements. Spectral clustering using the weighting map is then performed to determine the coherent regions—i.e., functional units—defined by the tongue motion. Two-dimensional image data is used to verify that the proposed algorithm clusters the different types of images accurately. Three-dimensional tMRI data from five subjects carrying out simple non-speech/speech tasks are analyzed to show how the proposed approach defines a subject/task-specific functional parcellation of the tongue in localized regions.
1 Introduction
The relationship between the structural and functional components of the tongue is poorly understood partly due to the complex tongue anatomy and muscle interactions. The human tongue is a volume preserving structure with highly complex, orthogonally oriented, and interdigitated muscles. The tongue muscles interact with one another in order to carry out the oromotor behaviors of speaking, swallowing, and breathing, which are executed by deforming local functional units in this complex muscular array. Tongue motions are synergies created by locally deforming regions, or functional units [1]. Functional units are regions of the tongue that exhibit homogeneous motion during the execution of the specific task. Therefore, identifying functional units and understanding the mechanisms of coupling among them can identify motor control strategy in both normal and adapted speech (e.g., tongue motion after tongue cancer surgery).
To understand the function of the tongue, magnetic resonance imaging (MRI) has played a pivotal role in imaging both tongue surface motion using cine-MRI [2, 3] and internal tissue motion using tagged-MRI (tMRI) [4]. Despite the rich data on internal tissue motion that is available from tMRI, there has been very little research on its analysis to determine functional units. A key previous report is that of Stone et al. [5] who presented a method to determine functional segments using ultrasound and tMRI. Another key report is that of Ramanarayanan et al. [6], who used a convolutive non-negative matrix factorization (NMF) algorithm to determine tongue movement primitives from electromagnetic articulatory. Our work is inspired by both of these approaches, but we use far richer tMRI-derived data (3D displacement fields) and the NMF approach with the addition of sparsity and intrinsic data geometry in defining functional units.
Modeling a data matrix as sparse linear combinations of basis vectors is a popular approach to understanding speech production. Among them, NMF and variants involving sparsity have received substantial attention since the seminal work by Lee and Seung [7]. NMF is a matrix factorization method that focuses on data matrices whose elements are non-negative. NMF is based on a parts-based representation inspired by psychological and physiological observations about the human brain [8]. However, since standard NMF assumes a standard Euclidean distance measure for its data, it fails to discover the intrinsic geometry of its data [8].
In our work, we assume a manifold of the data within an NMF approach, which thereby captures the intrinsic geometry of the motion features derived from tMRI. In particular, we propose a new approach to determine functional units of tongue motion from tMRI using graph-regularized sparse NMF with spectral clustering. The method integrates a regularization term that encourages the computation of distances on a manifold rather than the whole of Euclidean space in order to preserve the intrinsic geometry of the observed motion data. The use of NMF is important because it does not allow negative combinations of basis vectors. This is consistent with the analysis of muscles, which either have positive activation or no activation, not negative activation. Both quantitative and qualitative evaluation results demonstrate the validity of the proposed method and its superiority to conventional clustering algorithms.
2 Proposed Approach
2.1 Problem Statement
Consider a set of P internal tongue tissue points each with n scalar quantities (e.g., magnitude and angle of each track) tracked through F time frames. These quantities characterize each point and are used to group them into functional units. The location of the p-th tissue point at the f-th time frame can be written as (). The tongue motion can then be represented by a 3F×P spatio-temporal feature matrix N = [n1, …, nP] ∈ 3F×P , where the p-th column is given by
| (1) |
We cast the problem of determining the functional units as a motion clustering problem. Thus, the goal is to determine a permutation of the columns to form [N1| N2| … |Nc] , where the submatrix Ni comprises point tracks associated with the i-th submotion—i.e., the i-th functional unit. We provide the proposed approach in more detail. The overall algorithm is shown below.

2.2 Extraction of Motion Quantities
The first step in our algorithm is to extract the motion features that characterize the cohesive motion patterns over time. We extract motion features including the magnitude and angle of the track as in [9] described as
| (2) |
| (3) |
| (4) |
| (5) |
where denotes the magnitude of the track and , , and represent the cosine of the angle projected in the z, x, and y axes plus one, respectively, which are in the range of 0 to 2. For clustering, we gather all the motion features into a 4(F − 1) × P non-negative matrix , where the p-th column can be expressed as
| (6) |
These features are always non-negative and can therefore be input to NMF.
2.3 Graph-regularized Sparse Non-negative Matrix Factorization
NMF
Given a non-negative data matrix U and k ≤ min(m,n), let be the building blocks and let be the weighting map. The goal of NMF is to learn building blocks and corresponding weights such that the input U is approximated by a product of two non-negative matrices (i.e., U ≈ VW). A typical way to define NMF is to use the Frobenius norm to measure the difference between U and VW [7] given by
| (7) |
where ‖·‖F denotes the matrix Frobenius norm. The solution can be found through the multiplicative update rule [7]:
| (8) |
| (9) |
Sparsity Constraint
In this work, we impose a sparsity constraint on the weighting map W. The sparsity constraint allows to encode the high-dimensional motion data using only a few active components, thereby making the weighting map easy to interpret. In particular, the weighting map obtained this way will represent the simplest tongue behavior that could generate the observed motion. In the NMF framework, it has been reported that a fractional regularizer using the L1/2 norm outperformed the L1 norm regularizer and gave sparser solutions [10]. Thus, we incorporate the L1/2 sparsity constraint into the NMF framework, which can be expressed as
| (10) |
where the parameter η ≥ 0 controls the sparseness of W and ‖W‖1/2 is defined as
| (11) |
Manifold Regularization
Many human motions lie on low-dimensional manifolds that are non-Euclidean [11]. NMF with the L1/2 norm sparsity constraint, however, produces a weighting map based on a Euclidean structure in the high-dimensional data space. Thus, the intrinsic and geometric relation between motion features may not be reflected accurately. To remedy this, we incorporate a manifold regularization that respects the intrinsic geometric structure as in [8,12,13]. The manifold regularization favors the local geometric structure and also serves as a smoothness operator, which reduces the interference of noise. Our final objective function incorporating both the manifold regularization and the sparsity constraint can then be given by
| (12) |
where λ is a balancing parameter of the manifold regularization, Tr(·) denotes the trace of a matrix, Q is a heat kernel weighting, D is a diagonal matrix where , and L = D − Q, which is the graph Laplacian.
Minimization
The objective function in Eq. (12) is not convex in both V and W and therefore it is not possible to find the global minima. In order to minimize the objective function, we use a multiplicative iterative method similiar to that used in [13]. Let Ψ = [ψmk] and Φ = [φkn] be the Lagrange multiplier subject to vmk ≥ 0 and wkn ≥ 0, respectively. By using the definition of the Frobenius norm, ‖U‖F = (Tr(UTU))1/2, and matrix calculus, the Lagrangian ℒ is given by
| (13) |
The partial derivatives of ℒ with respect to U and V are
| (14) |
By using Karush-Kuhn-Tucker conditions—i.e., ΨmkVmk = 0 and ΦknWkn = 0—the final update rule becomes
| (15) |
2.4 Spectral Clustering
The non-negative weighting map provides a good measure of regional tissue point similarity. Thus spectral clustering using the weighting map is adopted to determine the cohesive motion patterns as spectral clustering outperforms traditional clustering algorithms such as the K-means algorithm [14].
Once W is determined from Eq. (15), an affinity matrix A is constructed:
| (16) |
where w(i) is the i-th column vector of W and σ denotes the scale (we set σ = 0.02 in this work). The column vectors of W form nodes in the graph, and the similarity A computed between column vectors of W form the edge weights. On the affinity matrix, we apply a spectral clustering technique using a normalized cut algorithm [15].
3 Experimental Results
3.1 Experiments Using 2D Data
Since there is no ground truth in our in vivo data, we used two 2D datasets to demonstrate the clustering performance of the proposed method. The first dataset is the COIL20 image library, which contains 20 classes (32×32 gray scale images of 20 objects). The second dataset is the CMU PIE face database, which has 68 classes (32×32 gray scale face images of 68 persons). In order to compare the performance of the different algorithms, we used a K-means clustering method (K-means), a normalized cut method (N-Cut) [15], standard NMF with K-means clustering (NMF-K), graph-regularized NMF with K-means clustering (G-NMF-K) [8], graph-regularized NMF with spectral clustering (G-NMF-S), graph-regularized sparse NMF with K-means clustering (GS-NMF-K), and our method (GS-NMF-S). Two metrics, the Normalized Mutual Information (NMI) and the accuracy (AC), were used to measure the clustering performance as used in [8]. Table 1 lists the NMI and AC values, demonstrating that the proposed method outperformed other methods. We also compared the L1/2 and L1 norms experimentally, and the L1/2 norm had slightly better results.
Table 1.
Clustering Performance: NMI and AC
| NMI (%) | K-means | N-Cut | NMF-K | G-NMF-K | GS-NMF-K | G-NMF-S | Ours |
|
| |||||||
| COIL20 (K=20) | 73.80% | 76.56% | 74.36% | 87.59% | 90.11% | 90.24% | 90.63% |
| PIE (K=68) | 54.40% | 77.13% | 69.82% | 89.93% | 89.95% | 90.95% | 91.74% |
|
| |||||||
| AC (%) | K-means | N-Cut | NMF-K | G-NMF-K | GS-NMF-K | G-NMF-S | Ours |
|
| |||||||
| COIL20 (K=20) | 60.48% | 66.52% | 66.73% | 72.22% | 83.75% | 84.58% | 85.00% |
| PIE (K=68) | 23.91% | 65.91% | 66.21% | 79.3% | 79.93% | 80.60% | 84.31% |
3.2 Experiments Using In Vivo Tongue Data
We also tested our method using a simple non-speech protrusion task and a speech task: “a souk”. Four subjects said “a souk” and one subject performed the protrusion task. All MRI scanning was performed on a Siemens 3.0 T Tim Treo system (Siemens Medical Solutions, Malvern, PA) with 12-channel head and 4-channel neck coil. The tMRI datasets were collected using Magnitude Imaged CSPAMM Reconstructed images [16]. The datasets had a 1 second duration, 26 time-frames with a temporal resolution of 36 ms for each phase with no delay from the tagging pulse, 6 mm thick slices (6 mm tag separation), 1.875 mm in-plane resolution with no gap. The field-of-view was 24 cm. We acquired 2D orthogonal stacks of tMRI and used harmonic phase (HARP) to track internal tissue points. The incompressible deformation estimation algorithm was then used to combine 2D tracking data to produce the 3D tracking result with an incompressibility constraint [17].
Fig. 1 shows the protrusion task. The outer tongue layer expands forward and upward (but not backward), and the region near the jaw has little motion. Functional units, based on magnitude and angle, have been extracted for two (Fig. 1(b)) and three clusters (Fig. 1(c)). Fig. 1(b) is a good representation of forward protrusion (blue) vs no motion (red), but the three cluster output introduces noise, suggesting there are only two clusters, or functional units.
Fig. 1.

Illustration of protrusion showing (a) 3D displacement field, (b) functional units (2 clusters), (c) functional units (3 clusters)
Fig. 2 shows the motion from /s/ to /u/ during the word “a souk”. The functional units were determined using our method for two clusters (Fig. 2(b)) and three clusters (Fig. 2(c)), respectively. Note that the three clusters better represent the motions of the tongue. These motions include backward motion of the tongue tip (blue), upward motion of the tongue body (green), and forward motion of the posterior tongue (red).
Fig. 2.

Illustration of motion from /s/ to /u/ showing (a) 3D displacement field, (b) functional units (2 clusters), (c) functional units (3 clusters)
4 Discussion and Conclusion
In this work, inspired by recent advances in sparse NMF and manifold learning, we presented a novel method for determining functional units from tMRI. Unlike previous algorithms, this proposed work aims at identifying the internal, coherent manifold structure of high-dimensional motion data to determine functional units. The contributions of this work are two-fold. In an NMF framework, we formulate a new clustering problem, that of a learning latent weighting map as well as spectral clustering and we give an efficient algorithm to solve this problem. Our method performed better than K-means, N-Cut, NMF-K, G-NMF-K, GS-NMF-K, and G-NMF-S using 2D data. In a tongue motion analysis context, we define functional units from tMRI, which opens new vistas to study speech production. The identified functional units are visually assessed and further studies using biomechanical stimulations are needed to co-validate our findings due to the lack of ground truth in in vivo data. The proposed method gives a principled method for defining subject/task-specific functional units, which can be potentially used to elucidate speech-related disorders.
References
- 1.Green JR. Tongue-surface movement patterns during speech and swallowing. The Journal of the Acoustical Society of America. 2003;113(5):2820–2833. doi: 10.1121/1.1562646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Stone M, Davis EP, Douglas AS, Aiver MN, Gullapalli R, Levine WS, Lundberg AJ. Modeling tongue surface contours from cine-MRI images. Journal of speech, language, and hearing research. 2001;44(5):1026. doi: 10.1044/1092-4388(2001/081). [DOI] [PubMed] [Google Scholar]
- 3.Bresch E, Kim YC, Nayak K, Byrd D, Narayanan S. Seeing speech: Capturing vocal tract shaping using real-time magnetic resonance imaging. IEEE Signal Processing Magazine. 2008;25(3):123–132. [Google Scholar]
- 4.Parthasarathy V, Prince JL, Stone M, Murano EZ, NessAiver M. Measuring tongue motion from tagged cine-MRI using harmonic phase (HARP) processing. The Journal of the Acoustical Society of America. 2007;121(1):491–504. doi: 10.1121/1.2363926. [DOI] [PubMed] [Google Scholar]
- 5.Stone M, Epstein MA, Iskarous K. Functional segments in tongue movement. Clinical linguistics & phonetics. 2004;18(6-8):507–521. doi: 10.1080/02699200410003583. [DOI] [PubMed] [Google Scholar]
- 6.Ramanarayanan V, Goldstein L, Narayanan SS. Spatio-temporal articulatory movement primitives during speech production: Extraction, interpretation, and validation. JASA. 2013;134(2):1378–1394. doi: 10.1121/1.4812765. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401(6755):788–791. doi: 10.1038/44565. [DOI] [PubMed] [Google Scholar]
- 8.Cai D, He X, Han J, Huang TS. Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2011;33(8):1548–1560. doi: 10.1109/TPAMI.2010.231. [DOI] [PubMed] [Google Scholar]
- 9.Cheriyadat AM, Radke RJ. Non-negative matrix factorization of partial track data for motion segmentation; IEEE 12th International Conference on Computer Vision.2009. pp. 865–872. [Google Scholar]
- 10.Qian Y, Jia S, Zhou J, Robles-Kelly A. Hyperspectral unmixing via sparsity-constrained nonnegative matrix factorization. IEEE Transactions on Geoscience and Remote Sensing. 2011;49(11):4282–4297. [Google Scholar]
- 11.Elgammal A, Lee CS. The role of manifold learning in human motion analysis. Human Motion. 2008;25:56. [Google Scholar]
- 12.Yang S, Hou C, Zhang C, Wu Y. Robust non-negative matrix factorization via joint sparse and graph regularization for transfer learning. Neural Computing and Applications. 2013;23(2):541–559. [Google Scholar]
- 13.Lu X, Wu H, Yuan Y, Yan P, Li X. Manifold regularized sparse NMF for hyperspectral unmixing. IEEE Transactions on Geoscience and Remote Sensing. 2013;51(5):2815–2826. [Google Scholar]
- 14.Von Luxburg U. A tutorial on spectral clustering. Statistics and computing. 2007;17(4):395–416. [Google Scholar]
- 15.Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000;22(8):888–905. [Google Scholar]
- 16.NessAiver M, Prince JL. Magnitude image CSPAMM reconstruction (MICSR) Magnetic resonance in medicine. 2003;50(2):331–342. doi: 10.1002/mrm.10523. [DOI] [PubMed] [Google Scholar]
- 17.Xing F, Woo J, Murano EZ, Lee J, Stone M, Prince JL. 3D tongue motion from tagged and cine MR images. Medical Image Computing and Computer-Assisted Intervention. 2013:41–48. doi: 10.1007/978-3-642-40760-4_6. [DOI] [PMC free article] [PubMed] [Google Scholar]
