Abstract
Detecting how genes regulate biological shape has become a multidisciplinary research interest because of its wide application in many disciplines. Despite its fundamental importance, the challenges of accurately extracting information from an image, statistically modeling the high-dimensional shape and meticulously locating shape quantitative trait loci (QTL) affect the progress of this research. In this article, we propose a novel integrated framework that incorporates shape analysis, statistical curve modeling and genetic mapping to detect significant QTLs regulating variation of biological shape traits. After quantifying morphological shape via a radius centroid contour approach, each shape, as a phenotype, was characterized as a high-dimensional curve, varying as angle θ runs clockwise with the first point starting from angle zero. We then modeled the dynamic trajectories of three mean curves and variation patterns as functions of θ. Our framework led to the detection of a few significant QTLs regulating the variation of leaf shape collected from a natural population of poplar, Populus szechuanica var tibetica. This population, distributed at altitudes 2000–4500 m above sea level, is an evolutionarily important plant species. This is the first work in the quantitative genetic shape mapping area that emphasizes a sense of ‘function’ instead of decomposing the shape into a few discrete principal components, as the majority of shape studies do.
Keywords: morphological shape, quantitative shape genetic mapping, functional data analysis, Populus szechuanica var tibetica, leaf shape
Introduction
Tremendous variation in morphology has provided conspicuous evidence for environmental adaptation, biological evolution, natural selection and genetic involvement [1–3]. It has been reported that genes have an important role in controlling phenotypic variation in shape [4–6], and leaf shape has a high heritability [7–9]. Shape (also called form or morphology) is formally defined as the geometrical information that remains when location, scale and rotational effects are filtered out from an object [10]. Three main challenges in the quantitative genetic shape mapping research call for better approaches: (1) how to describe shape accurately, (2) how to model high-dimensional shape curves better without losing too much information and (3) how to detect shape quantitative trait loci (QTL) [7, 11–18].
In current literature, the majority of shape traits have been modeled by multivariate analysis. A few discrete landmark points are first extracted to describe a shape on a coarse scale, and then multivariate analysis approaches are implemented to model these discrete multiple points. Specifically, principal component analysis (PCA) has been the most popular approach in shape studies that projects the multidimensional space into a much lower dimension [7–9, 17, 19–31]; multivariate regression models have been used for analyzing allometry or evolutionary change in shape over time [22, 32]. Although these approaches achieved significant results for genetic shape mapping research, there still is room for improvement. A small amount of sparse landmark points neglect many subtle details of a shape, particularly irregular outlines. Arguments also exist regarding the functional interpretations of PCA [33]. Additionally, multivariate analysis does not use the dynamic trajectory of a trait and hence cannot capture the overall trends of a shape in a functional sense. As a continuous dynamic pattern can be observed underneath discrete landmark points [34], a shape curve is more biologically meaningful when treated as a function or trajectory [35] rather than multiple discrete points.
This article introduces an alternative approach that improves on existing quantitative genetic shape mapping research by describing shape information using high-dimensional curves and capturing smooth trajectories underlying shape curves using functional data analysis (FDA) strategies. While FDA is a relatively mature approach in the field of statistics [36–41], it has seldom been used in shape research. As far as we know, in current quantitative shape literature, there are only two articles implementing FDA to shape analysis [42, 43]. However, these two articles had dramatically different aims and foci than our article. The aforementioned articles simply studied shape curves with the user-friendly ‘FDA’ library without involving any genetic association. The genetic association that we considered here not only involved the shape trait (as the response) and genetic variables (including a hidden variable) but also doubled the difficulties of statistical modeling because no available package or model could accurately solve our problems.
This article was motivated by leaf shape data collected from a natural population of poplars, Populus Szechuanica var. tibetica. The radius centroid contour (RCC) approach extracted shape information accurately and worked well for any two-dimensional convex outlines. Fu et al. [28] described the details of how shape information was extracted from an image and how RCC transformed a shape into a high-dimensional curve [44, 45]. The main focus of this article is addressing the second and third aforementioned challenges of statistically modeling high-dimensional shape curves, and detecting significant QTLs regulating variation of shape curves. After the shape analysis step, each input image was output in the form of high-dimensional curve varying with the angle θ running clockwise. We modeled the dynamic trends of means and covariance structures of shape curves as functions of θ via nonparametric kernel regression and functional principal component analysis (FPCA) [38–41]. Unlike traditional parametric approaches that directly coped with high-dimensional covariance matrices, FPCA describes random trajectories in terms of eigenfunctions that are interpreted as modes of variation of shape curves. The framework proposed in this study was data adaptive, without assuming any prespecified parametric structures. Additionally, it incorporated shape analysis, FDA and a linkage disequilibrium (LD)-based QTL detection scheme into a novel framework to improve the methodology of quantitative shape mapping research, particularly when modeling high-dimensional shape curves. Besides shape, our approach is applicable to any other data set with curves as response and categorical variables as predictors.
Mixture FDA model
Let denote the RCC value (i.e. response in statistical terminology and phenotype in genetic terminology) of the ith subject at the jth angular degree. Here, Ni was the total number of angles measured for the ith subject. To make comparisons of different shapes meaningful, we set the number of measurements consistent for each leaf (i.e. Ni = 180 for each ). However, we kept the subject-related notation Ni to make it easy to extend to any irregular or sparse case. As the measurement scales were also allowed to vary for different subjects, we considered θ as a random variable, denoted as , lying in a compact interval . Then for each shape i, the original measurements were .
We denoted two alleles of a marker as M (with allele frequency p) and m (with allele frequency ) and two alleles of a QTL as A (with allele frequency q) and a (with allele frequency ). The two alleles of a QTL formed three genotypes, expressed as AA (coded as c = 3), Aa (coded as c = 2) and aa (coded as c = 1). The genotype of a marker was a categorical predictor, and the genotype of a QTL was a hidden variable. We assumed that the QTL and marker were associated through a LD parameter D. The marker and QTL formed four haplotypes MA, Ma, mA and ma, with the frequencies and , respectively. The haplotypes from maternal and paternal parents unite randomly to generate nine marker-QTL genotype combinations. The LD-based QTL mapping model detected the hidden QTL by way of the conditional probability for a QTL genotype c given an observable marker genotype for subject i, expressed as . For a natural population satisfying the Hardy–Weinberg equilibrium, was computed from a 3 × 3 contingency table in terms of p, q and D [46].
The basic statistical idea behind QTL mapping was a mixture model, in which each observed shape was assumed to be attributed to one of the three QTL genotypes, and each genotypic effect on phenotype was modeled from a separate density function (assuming that each quantitative shape curve follows a Gaussian process). Specifically, the effects of a QTL responsible for variation of shape traits could be modeled as [47, 48]
| (1) |
where ξic is an indicator variable describing a possible QTL genotype c for subject i. That is, ξic was coded as 1 if the corresponding QTL genotype c had an effect on ith shape, and 0 otherwise. is the underlying stochastic process of defined in for a particular genotype (c = 1, 2, 3). The trajectories of are also smooth functions of θ. is the experimental error assumed to be independent and identically distributed as .
For a fixed genotype c, we denoted the mean function of as and the covariance function of it as . The mean function was interpreted as the cth genotypic effect of a QTL on the shape trait. Throughout this article, it was assumed that was a smooth function for , and was a positive definite and bivariate smooth function for . Being ‘smooth’ means that the function is twice continuously differentiable. The idea of model (1) was that the originally observed messy shape curves could be modeled as a mixture of three smooth functions accounting for genotypic effects and additive noise.
Nonparametric functional PCA model
The main idea of FPCA is to interpret as the kernel of a linear mapping on the space of square-integrable functions , mapping to , defined by Hall et al. [49] as
An eigenfunction of the operator AG is a solution of the equation , with corresponding eigenvalue λ. For a fixed c, we assumed that the operator AG had a sequence of smooth orthonormal eigenfunctions satisfying (here, δklc was the Kronecker symbol), with ordered eigenvalues . By Mercer’s theorem [50], a spectral decomposition on the function , Hilbert–Schmidt kernel, yielded
| (2) |
Applying the generalized Fourier expansion (Karhunen–Loeve theorem proposed by Karhunen [51]), the stochastic process can be decomposed into
| (3) |
where the sum is defined in the sense of L2 convergence, with uniform convergence, and
| (4) |
ζilc are uncorrelated random variables with , and , subject to L2 convergence, i.e. . ζilc is frequently referred to as the lth functional principal component scores or the lth dominant modes of variation structure.
Combining Equations (1) and (3), we obtained
| (5) |
Parameter estimate
From the original observed data set
we obtained estimators for all unknown parameters and contained in model (5). If the estimator was obtained, we could compute a rough estimator of covariance matrix by
The local linear smoother estimator for was obtained by minimizing
| (6) |
with respect to [40, 41, 47]. is the bivariate nonnegative compactly supported kernel function used as weights for locally weighted least squares smoothing [47, 52]. In this article, we used the Epanechnikov function as the kernel functions [53]. Also, hG is the bandwidth corresponding to the kernel function K2. The minimization of Equation (6) yielded [41], which could be solved in a closed form [47] as follows:
| (7) |
Once the smoothed covariance function was computed from Equation (7), it was then discretized on a suitable finite grid and represented as a covariance matrix.
The estimators of the eigenfunctions were obtained by a corresponding spectral decomposition on [54]. To be more specific, are the eigenvalues of , given by
Furthermore, the are the eigenfunctions corresponding to , satisfying and if . Here, also agreed with an empirical version of the expansion (2)
| (8) |
where I is the indicator function used to keep only the terms with positive eigenvalues. The positive definiteness of the estimated covariance matrix is not always guaranteed, which might cause problems in practical applications. To avoid potential problems, Yao et al. [38] proposed a solution by only keeping positive eigenvalues [34]. If some were negative, then we dropped these negative eigenvalues and their corresponding eigenfunctions, and reconstituted the estimate from the remaining eigenvalue and eigenfunction estimates until only positive eigenvalues were retained.
Once and were obtained, fitting trajectories required the estimation of functional principal component scores. By discretizing on Equation (4), plugging and into a Riemann sum approximation of the integral, and setting , we obtained
| (9) |
[34]. This approximation by sum worked well for the regular case. If the data were sparse or irregular, another approach called Principal Analysis through Conditional Expectation could be considered [39].
We then estimated the mean function by removing the covariance portion to explore the remaining terms. We defined
| (10) |
Then, Equation (5) yielded
| (11) |
The only unknown parameters in Equation (11) were and . We integrated the Expectation–maximization (EM) algorithm and local polynomial kernel smoothing to estimate these parameters.
Let be m grid points at which the values of mean functions will be estimated. For any , we approximated from for θj located within a bandwidth hu of ωa, and . The unknown parameters in Equation (11) could be estimated by maximizing the following local log likelihood
| (12) |
Here, is a nonnegative univariate compactly supported kernel function that was used as weights for local polynomial smoothing. The EM algorithm based on was derived in the following steps. The E step was designed to calculate the posterior probability that subject i had a QTL genotype c given the marker genotype and phenotype data via
Using these posterior probabilities, the M step was derived to estimate the haplotype frequencies [55] and all other unknown parameters via
| (13) |
where , and the observation numbers of three marker genotypes were denoted as N1 for MM, N2 for Mm and N3 for mm.
We summarized the estimating procedures in the following steps:
Set initial values for all unknown parameters. Computed the initial value of from the observed marker genotype. As QTL was not observable and q was an important parameter, we extensively searched the initial values of , scanning from 0.1 to 0.9 with a scale of 0.1. The optimal initial value of was determined by the maximum value of the log likelihood (12). Doing so allowed us to avoid the sensitivity of initial values and improved the accuracy of the estimator. The iteration is ready to start after selecting initial values for and . The initial values of and are derived from the initial values of and .
Computed a rough estimator of covariance matrix .
Estimated smooth covariance function by bivariate local linear smoothing Equations (6 and 7).
Computed eigenfunctions and eigenvalues of discretized . Only kept the terms with positive eigenvalues.
Used Equation (9) to compute functional principal component scores .
Computed from Equation (10).
Applied EM algorithm to update and . Here, and could be derived from and via and .
Repeated step (2) through step (7) until all estimators converged.
Tuning parameter selection
The number of eigenfunctions used to approximate infinite-dimensional longitudinal process, as well as degree of smoothness, was two tuning parameters that critically affected the performance of the model. Degree of smoothness was determined by the number of grid points and bandwidths hu and hG. The density of the grid determines the accuracy of the analysis. In other words, the larger the number of grid points, the more accurate the estimation will be. Therefore, we set (100 points) as a ‘dense enough’ choice of the grid. Additionally, following the empirical suggestions of Huang et al. [47] and Wang et al. [48], we chose the number of eigenfunctions required to explain at least 80% of the total variation. The tuning parameters that needed to be selected were bandwidth hu and bandwidth hG.
One-subject-leave-out cross-validation has been a popular approach to select tuning parameters. However, because of the large number of computations because of multiple markers, several initial scans of q and bandwidth selection, this would be computationally expensive. Therefore, we used Akaike information criterion (AIC) instead, as Yao et al. [39] used it to obtain similar results in a more computationally efficient manner.
The AIC has the form , where L is the maximum value of the log-likelihood function (12), and df is the effective degrees of freedom that measure the complexity of the model. As models (5) and (12) involve , ζilc, and local polynomial kernel smoothing based on grid points, it is not trivial to determine the degrees of freedom. Inspired by Fan et al. [56] and Wang et al. [48], we adjusted their degrees of freedom to define the shape models presented in this article as . For the one-dimensional mean functions, the effective degree of freedom can be computed from [48, 56]
where is the support of θ, , and
Similarly, the effective degree of freedom for the two-dimensional covariance function can be defined as [48, 56]
Hypothesis tests
The effects of a QTL on shape variation were mainly manifested by three smooth mean functions and in Equation (5). To draw a scientific conclusion about whether a QTL was significantly associated with shape variation, a hypothesis test was needed:
| (14) |
H0 corresponded to the reduced model, in which only a single mean function was involved, which indicated that a QTL had no controls for the phenotypic variation. H1 corresponded to the full model, in which three mean functions were necessary for three different genotypes. The test statistic was calculated as the log-likelihood ratio of the reduced to full model.
In addition to the above hypothesis test related to the QTL’s existence, we also tested the strength of LD between the QTL and the marker. The basic assumption behind LD-based QTL model was the existence of LD between the hidden QTL and the observed marker. The significance of LD could be tested by
| (15) |
where H0 corresponded to the case that a marker and a QTL were at linkage equilibrium (i.e. no correlation). The test statistic for this hypothesis test was
which was reported to follow a distribution with one degree of freedom for a univariate trait [28, 57]. However, for the high-dimensional functional shape trait considered in this article, the assumed distribution may not hold.
A significant shape-related QTL is not detected unless tests (14) and (15) are simultaneously rejected for each marker, where (14) tests for an association between QTL and shape, and (15) tests for LD between the observable marker and the detected (yet unobservable) QTL. Rejecting both tests simultaneously indicates that the QTL exists and is located around the observed marker, as opposed to the QTL being both randomly segregated and located far away from the marker. In this context, the intersection union test (IUT) approach should be used. However, as the distributions defined for a univariate simple trait may not hold for a high-dimensional complex trait, we permuted 1000 times to determine the significance for these two tests [58].
Real leaf shape data analysis
We worked on a real leaf data set collected from poplars, Populus Szechuanica var. tibetica. A sample of n = 106 independent trees was randomly selected from a natural population distributed throughout the Tibet plateau. Twenty-nine microsatellite markers (16 primers) were genotyped. Meanwhile, the leaf images were collected by photographing one randomly picked representative leaf per tree. Each image was then processed using the shape analysis skills and quantitatively represented by a RCC curve of 180 dimensions. Figure 1 illustrates the detailed object and background separation, contour extraction and RCC transformation procedures [28].
Figure 1.
Diagrammatic illustration of shape analysis procedure. (A) Original photo with leaf object put on top of a cloth; (B) segmentation with object represented as white and background represented as black; (C) extracted contour of leaf from the background and removed leaf stem; and (D) RCC curve by computing the radius between centroid and contour points.
We chose K = 4 as the optimal number of eigenfunctions because they explained 83.7% of the total variation for all leaves observed, and both the magnitudes of eigenvalues and the percentage of variation explained dropped off dramatically after the fourth order and the changes resulting from remaining eigenfunctions were trivial. The choice of grid points was . Moreover, we chose the optimal bandwidths as and after scanning 25 candidates via AIC from the set of and . Figure 2 illustrates the AIC values of all 25 combinations for these two tuning parameters.
Figure 2.
AIC values for 25 combinations of fine scanning on bandwidths hu and hG.
We performed the proposed method for each of the candidate markers. The markers determined significant by permutation are GCPM_1063 with , P-value of 0.004 for the QTL test and P-value of 0.004 for the LD test; and GCPM_1034-1 with , P-value of 0.003 for the QTL test and P-value of 0.004 for the LD test. Note that the number of permutations affects the P-values, and the reported P-values mean that only three or four among 1000 permuted test statistic values are more extreme than the observed test statistic values on real data. We demonstrate the result for the most significant finding GCPM_1063 in the following figures and paragraphs.
The estimated smoothing mean functions for three genotypes of the QTL detected by marker GCPM_1063 are demonstrated in Figure 3A. The light blue curves were the RCC values of the originally observed noisy leaf shapes, from which we easily noticed that underlying dynamic patterns existed. As θ increased clockwise from 0 to (), three mean curves , c = 1, 2, 3, corresponding to three different genotypic effects, showed different functional trajectories. The biggest variations were observed at and where genotype aa (in black) had the highest peak. As the leaf blade was roughly bilateral symmetric, we expected to observe symmetric patterns over the left and right side of and , respectively. Figure 3B was the transformed version of Figure 3A from polar coordinates to xy coordinates. We computed xy coordinates by and , with . The polar coordinates demonstrated the results of FDA modeling better, and the xy coordinate demonstrated the structures of shape better. The light blue circles were original noisy leaf shapes. The three colored mean contours overlaying on top of the observed leaf contours illustrated the three genotypic effects of a QTL regulating leaf shape variation detected by marker GCPM_1063. The genotype aa (in black) showed a narrow and elongated pattern (an elliptical leaf shape), Aa (in orange) showed a slightly wider and shorter pattern (an ovate leaf shape) and AA (in red) showed the shortest and widest pattern (a cordate leaf shape). The most striking shape variation regulated by the three genotypic effects of the QTL detected by marker GCPM_1063 could be observed near the leaf base, tip and length/width ratio.
Figure 3.
(A) Estimated mean functions of RCC curves for three genotypes of the QTL detected by marker GCPM_1063. AA in red, Aa in orange and aa in black. The light blue curves represent all original leaf shapes in the data sets. (B) Transformed from polar coordinate in Figure 3A to xy coordinate, i.e. from vector space to image domain. AA in red, Aa in orange and aa in black.
Four eigenfunctions , shown in Figure 4, were used to approximate the infinite-dimensional decomposition of the estimated smooth covariance function in Equation (8). These eigenfunctions considered not only between subject variation but also within subject variation. The first eigenfunction had trends similar to the mean function, which indicated that the largest variation was achieved near and . The three different genotypes of the QTL detected by marker GCPM_1063 explained different percentages of total variation. The first eigenfunction for AA (in red) explained 44.66% percentage of total variation, for Aa (in orange) explained 62.96% percentage of total variation and for aa (in black) explained 53.86% of total variation. We also noticed that the first eigenfunction had negative values over the intervals , and , for all c = 1, 2, 3. It indicated that a RCC curve with a positive (or negative) functional principal component score along the direction of tends to have smaller (or larger) values in these intervals of θ than the overall population average. Unlike the first eigenfunction, the remaining three eigenfunctions explained small percentages of total variation from different directions. The largest absolute magnitudes of the second eigenfunction for the three genotypes happened around . The third eigenfunction divided variation into smaller segments and varied every segment. The orange line of the fourth eigenfunction had negative values around the interval , whereas red and black lines had positive signs on the same interval. Different signs indicated that the variation of genotype Aa changed in the opposite direction as those of aa and AA. In summary, the four eigenfunctions of genotype AA (in red) explained 86% of total variation, Aa (in orange) 94% and aa (in black) 90%, respectively.
Figure 4.
Four estimated eigenfunctions of RCC curves for three genotypes of the QTL detected by marker GCPM_1063. AA in red, Aa in orange and aa in black.
Discussion
In this article, we integrated a shape feature extraction skill, a mixture FDA model, a LD-based QTL detection scheme and an IUT hypothesis testing approach into one framework to detect significant QTLs regulating shape variation. After quantifying shapes by the RCC approach, each image was in the format of a high-dimensional curve. Then, we used a local polynomial kernel smoothing approach to model the smooth mean function and FPCA approach to model the smooth covariance function [38–41]. The original shape curves were modeled as samples of one of the three mean functions corresponding to three genotypic effects of a QTL plus additive noise. The three genotypic effects were estimated by incorporating the EM algorithm into a mixture FDA model. Besides the white noise caused by experimental error, the sum of infinite-dimensional smooth eigenfunctions explained between- and within-subject variation [41]. Finally, the equality of mean functions for three genotypes was tested on each candidate marker to determine the significance of the effect of a QTL, and the LD parameter was tested to evaluate the strength of the LD between the detected QTL and each candidate marker. The IUT and permutation approaches were used to guarantee the simultaneous rejection of these two tests before a significant conclusion could be made. We used MATLAB to computationally implement the modeling proposed in this article. The code is available per e-mail request.
We worked on a poplar leaf shape data set and detected two significant markers, GCPM_1063 and GCPM_1034-1. We observed great disparities in both mean and variance structures among three genotypes of the QTL in Figures 3 and 4, which visually confirmed the existence of the QTL. Additionally, the output (both mean and variance structures) of this approach is in the form of high-dimensional curves or functions with corresponding angles. This output makes it possible to demonstrate the association between shape changes and QTL, as well as variations in the nature and strength of the QTL effect.
This data set only has limited biallelic markers. These markers neither provide full genome coverage nor the concomitant ability to pinpoint specific loci affecting the trait. Therefore, this article was not intended for making any report about important biological or genetic findings and does not restrict the number of candidate genes, gene names or locations. Despite the data set limitations, it represents the only data currently available for poplar species. Additionally, a data set simultaneously containing leaf shape photos (phenotype) and a large-scale genetic markers is indeed rare for all species. The proposed methods are applicable to other shape data with a higher density of genetic markers. If parallel could be incorporated, it would also work for the GWAS shape data (with a high-dimensional phenotype in curves, as well as a high-dimensional genotype data covering the entire genome).
We overcame two challenges in current quantitative genetic shape mapping field. First, we accurately described shape quantitatively without much loss of information. Second, we efficiently modeled high-dimensional shape curves. The majority of the current shape studies have used PCA to decrease the dimension of the high-dimensional shape curves and have extracted shape feature from principal component scores as the modeling target [7–9, 17, 19–31]. However, the limitations of PCA have been noticed and reported in recent literature [33]. Specifically, the modeling of PCs (especially those PCs explaining a small amount of total variation) is sensitive to outliers, neglects the overall dynamic trends of shape, skips the intercorrelations among different shape components, yields spurious significance associations and produces false discovery genetic findings [59]. As a solution, we provided an alternative approach so that the entire shape curve (instead of the isolating PCs) could be modeled functionally. The isolating PC approach only uses partial information and therefore is less accurate and harder to interpret. Additionally, the mixture functional framework handles the heterogeneity common in genetic data [60, 61], and serves the data-driven demands better than traditional FDA.
Shape analysis has been widely studied in a large span of disciplines, including agriculture [62], plant genetics [8], ecology and evolution [63], developmental biology [18, 64], public health [65] and so on. Specifically, it has been used to analyze a variety of plant characteristics, including the root shape of Japanese radishes [66], as well as the leaves of tomatoes [29], fossils [67], grapevines [9, 31], temperate rainforest trees [68] and flowers [69], to name a few. Shape analysis has also been used to model several human health-related characteristics including bones [70], teeth mandibles [71], brains [72], hearts [65], human skulls [73] and many other traits. Leaf shape has been found to be correlated with important fruit attributes such as the sugar level of fruit [8], flavor and seedlessness of grapes [9]. The shape of plant anatomy (root, grain, vegetable, leaf and fruit) plays a crucial role in improving quantity and quality of agricultural products and crop improvement, which might be attributable to the indirect impact of leaf shape on photosynthesis or the shared pathways between plant organs [66]. Additionally, the shape of agricultural products is also important for marketing, grading and classifying products in terms of commercial quality [62]. Paleontologists use the size and shape of fossilized plants to reconstruct ancient climate [67]. The analyses of butterflies suggest a correlation between flight performance and wing shape [74]. The shape and contour of a person’s teeth and mandibles have large implications for their smile, self-expression and function in society [75]. Some human diseases are manifested in abnormal shape of organs. For example, the shape analysis of the human heart may help diagnose hypertrophic cardiomyopathy and hypertensive heart disease [65]. Shape analysis of the human skull has also been used to study the evolutionary process [73]. For these reasons, this article will contribute to the improvement in methodology in the quantitative shape mapping field and promote multidisciplinary collaborations.
The future applications of shape mapping in the aforementioned disciplines will extend the dimension of shape descriptors from two to three, incorporate multiple high-dimensional shapes per subject as the modeling target, study the development trajectories of shape (longitudinal shape) and consider more complex factors associated with the shape variation, such as the gene–environment interactions, gene–gene interactions, polygenic effects and so on. All of these future directions call for advanced statistical models. In addition to the general benefits of FDA models in shape mapping studies that are presented here, our framework could also handle other mixture Gaussian processes with a hidden variable and uncertain classifications, and associations between curve response (other than shape) and categorical predictors, including situations where the level of categorical predictors is unknown or is higher than three. Additionally, the proposed approach could be easily extended to sparse or irregular cases.
Key Points
Morphological shape is an important trait that responds to environmental adaptation, evolution, natural selection and genetic involvement. Shape has higher heritability than other univariate biological traits.
The majority of shape research has implemented PCA to reduce dimensionality and extracted the shape feature after a shape is described with a few discrete landmark points on a coarse scale.
FDA could be an alternative approach to directly model a high-dimensional curve after a shape is transformed into such a curve via a fine-scale shape analysis approach.
Acknowledgements
The authors thank Xiaotian Dai for making the figures, and thank Randall Reese and Brennan Bean for the English revisions.
Funding
The National Science Foundation (grant number DMS1413366 to G.F.); the National Institutes of Health (grant numbers U01 HL119178 and UL1 TR000127 to R.W.); and the National Science Foundation China (grant number 11301324 and Shanghai Chenguang Program to M.H.) and the Fundamental Research Funds for the Central Universities (grant number BLX2014-22 to B.W.).
Guifang Fu is an assistant professor of Statistics at Utah State University. Her research interests are shape analysis, statistical genetics and computational biology.
Mian Huang is an associate professor of Statistics at Shanghai University of Finance and Economics. His research interests are nonparametric modeling and functional data analysis.
Wenhao Bo is a Postdoc of plant genetics at Beijing Forestry University. His research interests are molecular genetics and plant genetics.
Han Hao is a graduate student of Statistics at Pennsylvania State University. Her research interests are statistical genetics and system biology.
Rongling Wu is a Changjiang Scholars Professor of Genetics at Beijing Forestry University and Distinguished Professor of Statistics at Pennsylvania State University. His research interests are shape analysis, statistical genetics, systems biology and computational biology.
References
- 1. Ricklefs RE, Miles DB.. Ecological and evolutionary inferences from morphology: an ecological perspective In: Ecological Morphology: Integrative Organismal Biology,Vol. 101 Chicago: University of Chicago Press, 1994, 13–41. [Google Scholar]
- 2. Reich PB. Body size, geometry, longevity and metabolism: do plant leaves behave like animal bodies? Trends Ecol Evol 2001;16(12):674–80. [Google Scholar]
- 3. Tsukaya H. Leaf shape: genetic controls and environmental factors. Int J Dev Biol 2005;49(5–6):547–55. [DOI] [PubMed] [Google Scholar]
- 4. Van der Knaap E, Lippman Z, Tanksley S.. Extremely elongated tomato fruit controlled by four quantitative trait loci with epistatic interactions. Theor Appl Genet 2002;104(2–3):241–7. [DOI] [PubMed] [Google Scholar]
- 5. Tanksley SD. The genetic, developmental, and molecular bases of fruit size and shape variation in tomato. Plant Cell 2004;16(suppl 1):S181–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Scarpella E, Barkoulas M, Tsiantis M.. Control of leaf and vein development by auxin. Cold Spring Harb Perspect Biol 2010;2(1):a001511.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Langlade NB, Feng X, Dransfield T, et al. Evolution through genetically controlled allometry space. Proc Natl Acad Sci USA 2005;102(29):10221–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Chitwood DH, Kumar R, Headland LR, et al. A quantitative genetic basis for leaf morphology in a set of precisely defined tomato introgression lines. Plant Cell 2013;25(7):2465–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Chitwood DH, Ranjan A, Martinez CC, et al. A modern ampelography: a genetic basis for leaf shape and venation patterning in grape. Plant Physiol 2014;164(1):259–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Kendall DG. Shape manifolds, procrustean metrics, and complex projective spaces. Bull Lond Math Soc 1984;16(2):81–121. [Google Scholar]
- 11. Cootes TF, Taylor CJ, Cooper DH, et al. Active shape models-their training and application. Comput Vis Image Underst 1995;61(1):38–59. [Google Scholar]
- 12. Dryden IL, Mardia KV.. Statistical Shape Analysis. J Wiley Chichester, 1998. [Google Scholar]
- 13. Abbasi S, Mokhtarian F, Kittler J.. Enhancing CSS-based shape retrieval for objects with shallow concavities. Image Vis Comput 2000;18(3):199–211. [Google Scholar]
- 14. Davies RH, Cootes TF, Taylor CJ.. A minimum description length approach to statistical shape modeling. Inf Process Med Imaging 2001;2082:50–63. [DOI] [PubMed] [Google Scholar]
- 15. Zhang J, Zhang X, Krim H, et al. Object representation and recognition in shape spaces. Pattern Recognit 2003;36(5):1143–54. [Google Scholar]
- 16. Super BJ. Fast correspondence-based system for shape retrieval. Pattern Recognit Lett 2004;25(2):217–25. [Google Scholar]
- 17. Renaud S, Auffray JC, De la Porte S.. Epigenetic effects on the mouse mandible: common features and discrepancies in remodeling due to muscular dystrophy and response to food consistency. BMC Evol Biol 2010;10(1):28.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Klingenberg CP. Evolution and development of shape: integrating quantitative approaches. Nat Rev Genet 2010;11(9):623–35. [DOI] [PubMed] [Google Scholar]
- 19. Myers EM, Janzen FJ, Adams DC, et al. Quantitative genetics of plastron shape in slider turtles (Trachemys scripta). Evolution 2006;60(3):563–72. [PubMed] [Google Scholar]
- 20. Albert AY, Sawaya S, Vines TH, et al. The genetics of adaptive shape shift in stickleback: pleiotropy and effect size. Evolution 2008;62(1):76–85. [DOI] [PubMed] [Google Scholar]
- 21. Bensmihen S, Hanna AI, Langlade NB, et al. Mutational spaces for leaf shape and size. HFSP J 2008;2(2):110–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Drake AG, Klingenberg CP.. The pace of morphological change: historical transformation of skull shape in St Bernard dogs. Proc R Soc Lond B Biol Sci 2008;275(1630):71–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Feng X, Wilson Y, Bowers J, et al. Evolution of allometry in Antirrhinum. Plant Cell 2009;21(10):2999–3007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Drake AG, Klingenberg CP.. Large-scale diversification of skull shape in domestic dogs: disparity and modularity. Am Naturalist 2010;175(3):289–301. [DOI] [PubMed] [Google Scholar]
- 25. Chitwood DH, Headland LR, Ranjan A, et al. Leaf asymmetry as a developmental constraint imposed by auxin-dependent phyllotactic patterning. Plant Cell 2012;24(6):2318–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Chitwood DH, Headland LR, Kumar R, et al. The developmental trajectory of leaflet morphology in wild tomato species. Plant Physiol 2012;158(3):1230–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Chitwood DH, Headland LR, Filiault DL, et al. Native environment modulates leaf size and response to simulated foliar shade across wild tomato species. PLoS One 2012;7(1):e29570.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Fu G, Bo W, Pang X, et al. Mapping shape quantitative trait loci using a radius-centroid-contour model. Heredity 2013;110(6):511–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Chitwood DH, Ranjan A, Kumar R, et al. Resolving distinct genetic regulators of tomato leaf shape within a heteroblastic and ontogenetic context. Plant Cell 2014;26(9):3616–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Chitwood DH, Topp CN.. Revealing plant cryptotypes: defining meaningful phenotypes among infinite traits. Curr Opin Plant Biol 2015;24:54–60. [DOI] [PubMed] [Google Scholar]
- 31. Chitwood DH, Klein LL, O’Hanlon R, et al. Latent developmental and evolutionary shapes embedded within the grapevine leaf. New Phytol 2015;210(1):343–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Monteiro LR. Multivariate regression models and geometric morphometrics: the search for causal factors in the analysis of shape. Syst Biol 1999;48(1):192–9. [DOI] [PubMed] [Google Scholar]
- 33. Bookstein FL. The relation between geometric morphometrics and functional morphology, as explored by Procrustes interpretation of individual shape measures pertinent to function. Anat Rec 2015;298(1):314–27. [DOI] [PubMed] [Google Scholar]
- 34. Müller HG. Functional modelling and classification of longitudinal data. Scand J Stat 2005;32(2):223–40. [Google Scholar]
- 35. Soille P. Morphological Image Analysis: Principles and Applications. Springer Science & Business Media, Springer-Verlag Berlin Heidelberg, 2013. [Google Scholar]
- 36. Ramsay J, Silverman B.. Functional Data Analysis. New York: Springer,1997. [Google Scholar]
- 37. Ramsay JO, Silverman BW.. Applied Functional Data Analysis: Methods and Case Studies. Citeseer, Springer-Verlag, NY, 2002. [Google Scholar]
- 38. Yao F, Müller HG, Clifford AJ, et al. Shrinkage estimation for functional principal component scores with application to the population kinetics of plasma folate. Biometrics 2003;59(3):676–85. [DOI] [PubMed] [Google Scholar]
- 39. Yao F, Müller HG, Wang JL.. Functional data analysis for sparse longitudinal data. J Am Stat Assoc 2005;100(470):577–90. [Google Scholar]
- 40. Yao F, Lee T.. Penalized spline models for functional principal component analysis. J R Stat Soc Series B Stat Methodol 2006;68(1):3–25. [Google Scholar]
- 41. Müller HG, Stadtmüller U, Yao F.. Functional variance processes. J Am Stat Assoc 2006;101(475):1007–18. [Google Scholar]
- 42. Nettel-Aguirre A. Nuclei shape analysis, a statistical approach. Image Anal Stereol 2008;27(1):1–10. [Google Scholar]
- 43. Epifanio I, Ventura-Campos N.. Functional data analysis in shape analysis. Comput Stat Data Anal 2011;55(9):2758–73. [Google Scholar]
- 44. Tan KL, Ooi BC, Thiang LF.. Indexing shapes in image databases using the centroid–radii model. Data Knowl Eng 2000;32(3):271–89. [Google Scholar]
- 45. Kong X, Luo Q, Zeng G, et al. A new shape descriptor based on centroid–radii model and wavelet transform. Opt Commun 2007;273(2):362–6. [Google Scholar]
- 46. Saunders G, Fu G, Stevens JR.. A Graphical Weighted Power Improving Multiplicity Correction Approach for SNP Selections. Curr Genomics 2014;15(5):380.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Huang M, Li R, Wang H, et al. Estimating mixture of Gaussian processes by kernel smoothing. J Bus Econ Stat 2014;32(2):259–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Wang S, Huang M, Wu X, et al. Mixture of functional linear models and its application to CO2-GDP functional data. Comput Stat Data Anal 2016;97:1–15. [Google Scholar]
- 49. Hall P, Müller HG, Wang JL.. Properties of principal component methods for functional and longitudinal data analysis. Annal Stat 2006;34(3):1493–517. [Google Scholar]
- 50. Indritz J. Methods in Analysis. Macmillan, 1963. [Google Scholar]
- 51. Karhunen K, Zur spektraltheorie stochastischer prozess. Annaleas Academiae Scientiarum Fennicae, 1946;34. [Google Scholar]
- 52. Fan J, Gijbels I.. Variable bandwidth and local linear regression smoothers. Ann Stat 1992;20:2008–36. [Google Scholar]
- 53. Epanechnikov VA. Non-parametric estimation of a multivariate probability density. Theory Probab Appl 1969;14(1):153–8. [Google Scholar]
- 54. Rice JA, Silverman BW.. Estimating the mean and covariance structure nonparametrically when the data are curves. J R Stat Soc Series B Methodol 1991;53:233–43. [Google Scholar]
- 55. Wang Z, Wu R.. A statistical model for high-resolution mapping of quantitative trait loci determining HIV dynamics. Stat Med 2004;23(19):3033–51. [DOI] [PubMed] [Google Scholar]
- 56. Fan J, Zhang C, Zhang J.. Generalized likelihood ratio statistics and Wilks phenomenon. Ann Stat 2001;29:153–93. [Google Scholar]
- 57. Wu R, Ma C, Casella G.. Statistical Genetics of Quantitative Traits: Linkage, Maps and Qtl. Springer Science & Business Media, Springer-Verlag, NY, 2007. [Google Scholar]
- 58. Churchill GA, Doerge RW.. Empirical threshold values for quantitative trait mapping. Genetics 1994;138(3):963–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Aschard H, Vilhjálmsson BJ, Greliche N, et al. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. Am J Hum Genet 2014;94(5):662–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Turnpenny PD, Ellard S.. Emery’s Elements of Medical Genetics. Elsevier Health Sciences, Churchill Livingstone, 2011. [Google Scholar]
- 61. McClellan J, King MC.. Genetic heterogeneity in human disease. Cell 2010;141(2):210–7. [DOI] [PubMed] [Google Scholar]
- 62. Costa C, Antonucci F, Pallottino F, et al. Shape analysis of agricultural products: a review of recent research advances and potential application to computer vision. Food Bioproc Tech 2011;4(5):673–92. [Google Scholar]
- 63. Hollander J, Collyer M, Adams D, et al. Phenotypic plasticity in two marine snails: constraints superseding life history. J Evol Biol 2006;19(6):1861–72. [DOI] [PubMed] [Google Scholar]
- 64. Zelditch ML, Swiderski DL, Sheets HD.. Geometric Morphometrics for Biologists: A Primer. Academic Press, 2012. [Google Scholar]
- 65. Ardekani S, Jain S, Sanzi A, et al. Shape analysis of hypertrophic and hypertensive heart disease using MRI-based 3D surface models of left ventricular geometry. Med Image Anal 2016;29:12–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. Iwata H, Niikura S, Matsuura S, et al. Evaluation of variation of root shape of Japanese radish (Raphanus sativus L.) based on image analysis using elliptic Fourier descriptors. Euphytica 1998;102(2):143–9. [Google Scholar]
- 67. Royer DL, Wilf P, Janesko DA, et al. Correlations of climate and plant ecology to leaf size and shape: potential proxies for the fossil record. Am J Botany 2005;92(7):1141–51. [DOI] [PubMed] [Google Scholar]
- 68. Ostria-Gallardo E, Ranjan A, Chitwood DH, et al. Transcriptomic analysis suggests a key role for squamosa promoter binding protein like, NAC and YUCCA genes in the heteroblastic development of the temperate rainforest tree Gevuina avellana (Proteaceae). New Phytol 2016;210(2):694–708. [DOI] [PubMed] [Google Scholar]
- 69. Miao Z, Gandelin MH, Yuan B.. A new image shape analysis approach and its application to flower shape analysis. Image Vis Comput 2006;24(10):1115–22. [Google Scholar]
- 70. Lu YC, Untaroiu CD.. Statistical shape analysis of clavicular cortical bone with applications to the development of mean and boundary shape models. Comput Methods Programs Biomed 2013;111(3):613–28. [DOI] [PubMed] [Google Scholar]
- 71. Polychronis G, Christou P, Mavragani M, et al. Geometric morphometric 3D shape analysis and covariation of human mandibular and maxillary first molars. Am J Phys Anthropol 2013;152(2):186–96. [DOI] [PubMed] [Google Scholar]
- 72. Styner M, Oguz I, Xu S, et al. Framework for the statistical shape analysis of brain structures using SPHARM-PDM. Insight J 2006;1071:242–50. [PMC free article] [PubMed] [Google Scholar]
- 73. Martínez-Abadías N, Esparza M, Sjøvold T, et al. Pervasive genetic integration directs the evolution of human skull shape. Evolution 2012;66(4):1010–23. [DOI] [PubMed] [Google Scholar]
- 74. Betts C, Wootton R.. Wing shape and flight behaviour in butterflies (Lepidoptera: Papilionoidea and Hesperioidea): a preliminary analysis. J Exp Biol 1988;138(1):271–88. [Google Scholar]
- 75. Sarver DM. Principles of cosmetic dentistry in orthodontics: part 1. Shape and proportionality of anterior teeth. Am J Orthod Dentofacial Orthop 2004;126(6):749–53. [DOI] [PubMed] [Google Scholar]




