Skip to main content
Journal of Anatomy logoLink to Journal of Anatomy
. 2019 May 7;235(2):357–378. doi: 10.1111/joa.12999

Measurement error in μCT‐based three‐dimensional geometric morphometrics introduced by surface generation and landmark data acquisition

Karolin Engelkes 1,, Jennice Helfsgott 1, Jörg U Hammel 2,3, Sebastian Büsse 4, Thomas Kleinteich 5, André Beerlink 6, Stanislav N Gorb 4, Alexander Haas 1
PMCID: PMC6637444  PMID: 31062345

Abstract

Computed‐tomography‐derived (CT‐derived) polymesh surfaces are widely used in geometric morphometric studies. This approach is inevitably associated with decisions on scanning parameters, resolution, and segmentation strategies. Although the underlying processing steps have been shown to potentially contribute artefactual variance to three‐dimensional landmark coordinates, their effects on measurement error have rarely been assessed systematically in CT‐based geometric morphometric studies. The present study systematically assessed artefactual variance in landmark data introduced by the use of different voxel sizes, segmentation strategies, surface simplification degrees, and by inter‐ and intra‐observer differences, and compared their magnitude to true biological variation. Multiple CT‐derived surface variants of the anuran (Amphibia: Anura) pectoral girdle were generated by systematic changes in the factors that potentially influence the surface geometries. Twenty‐four landmarks were repeatedly acquired by different observers. The contribution of all factors to the total variance in the landmark data was assessed using random‐factor nested permanovas. Selected sets of Euclidean distances between landmark sets served further to compare the variance among factor levels. Landmark precision was assessed by landmark standard deviation and compared among observers and days. Results showed that all factors, except for voxel size, significantly contributed to measurement error in at least some of the analyses performed. In total, 6.75% of the variance in landmark data that mimicked a realistic biological study was caused by measurement error. In this landmark dataset, intra‐observer error was the major source of artefactual variance followed by inter‐observer error; the factor segmentation contributed < 1% and slight surface simplification had no significant effect. Inter‐observer error clearly exceeded intra‐observer error in a different landmark dataset acquired by six partly inexperienced observers. The results suggest that intra‐observer error can potentially be reduced by including a training period prior to the actual landmark acquisition task and by acquiring landmarks in as few sessions as possible. Additionally, the application of moderate and careful surface simplification and, potentially, also the use of case‐specific optimal combinations of automatic local thresholding algorithms and parameters for segmentation can help reduce intra‐observer error. If landmark data are to be acquired by several observers, it is important to ensure that all observers are consistent in landmark identification. Despite the significant amount of artefactual variance, we have shown that landmark data acquired from microCT‐derived surfaces are precise enough to study the shape of anuran pectoral girdles. Yet, a systematic assessment of measurement error is advisable for all geometric morphometric studies.

Keywords: landmark precision, measurement error, micro computed tomography, surface simplification, thresholding

Introduction

Shape analysis and shape comparison by means of landmark‐based geometrics morphometrics are well‐established in biology and related fields, as can be seen by the numerous books (e.g. Bookstein, 2003; Zelditch et al. 2012), review articles (e.g. Adams et al. 2004, 2013; Mitteroecker et al. 2013), and practical applications (e.g. Klingenberg et al. 2002; Cox et al. 2011; Pujol et al. 2014). In general, the geometry of a specimen is represented by a set of landmarks, also called landmark configuration. Sets of homologous landmarks of different specimens are then superimposed by scaling, translation, and rotation, which allows for the comparison and analysis of shapes and shape differences among specimens.

Landmarks are represented by either two‐ or three‐dimensional (2D, 3D) Cartesian coordinates depending on the research question and study design. Different ways of landmark acquisition have been documented in the literature: 2D coordinates have been acquired from, for example, digitized or digital photographs of the specimens (e.g. Monteiro, 2000; Verhaegen et al. 2007), whereas 3D coordinates have been measured either directly from the specimens or from digital 3D representations of them (e.g. Collard & O'Higgins, 2001; Heuzé et al. 2016). As with all measurements, landmark coordinates are affected by measurement error (Arnqvist & Mårtensson, 1998), with measurement error being defined as the deviation of the measured value from the true value (Rabinovich, 2006). The presence of such artefactual variance in landmark data has recently been noted to have been overlooked in many geometric morphometric studies (Fruciano, 2016), and different studies have stressed the importance of measurement error assessment (e.g. Klingenberg, 2015; Fruciano, 2016; Robinson & Terhune, 2017).

Several factors have been shown to cause measurement error in landmark data (see also review in Fruciano, 2016). The preservation and preparation of specimens can induce artefactual variance by altering the natural form of the structures of interest (Lee, 1982[linear measurements]; Bonneau et al. 2012). The variability within repeated measurements performed by the same observer and the variability between different observers can also contribute significantly to measurement error (Ross & Williams, 2008; Robinson & Terhune, 2017; reports of relatively large observer error without tests for statistical significance: Curth et al. 2017; Fruciano et al. 2017; Daboul et al. 2018). Nevertheless, the magnitude of observer error often has been considered small or negligible compared with true biological variability (Richtsmeier et al. 1995; O'Higgins & Jones, 1998; Lockwood et al. 2002; Pujol et al. 2014; Barbeito‐Andrés et al. 2016).

The reduction of an actual 3D specimen to a 2D representation has been shown to cause error in landmark data. Depending on the specimens and research questions, this error has been acceptable and negligible in some cases but in others it has had considerable impact on biological inferences (Cardini, 2014; Buser et al. 2018 and references therein). In 2D landmark data that had been acquired from photographs and with related techniques, variation in the placement of the specimens in front of the camera and optical distortions have been identified to contribute to measurement error (see Arnqvist & Mårtensson, 1998 and references therein for a more detailed discussion of measurement error in 2D‐image‐based workflows). A measurement error introduced by the projection of 3D data to 2D can be avoided by recording 3D landmark coordinates directly from the specimen or a digital 3D representation of it. Studies on the effect of the choice of a method (including the choice of a device) for 3D landmark data acquisition have reported contradictory results on the significance of measurement error caused by method choice and on the effect of choice of method on observer error (e.g. Hale et al. 2014; Fruciano et al. 2017; Robinson & Terhune, 2017; Shearer et al. 2017; Marcy et al. 2018). Hale et al. (2014), for example, found significant differences between landmark data directly digitized from the specimens and landmark data derived from computed tomography (CT) scans of the same specimens. At the level of individual specimens, however, they found no significant difference between corresponding landmark sets. Shearer et al. (2017) reported no significant difference when comparing landmark data acquired from surfaces of a CT scan and of three different surface scanners. Robinson & Terhune (2017), in contrast, observed small, yet significant differences between sets of linear distances directly measured from the specimens (caliper measurements and digitization of landmarks) and from digital representations of them (landmarks from surfaces of CT and laser scanner). With regard to observer error, Shearer et al. (2017), for example, found no global dependence of observer error on the respective method chosen; yet, in some parts of their analysis they found significant dependencies. It should be noted, however, that differences between studies could be the result of different study designs and statistical approaches.

Although polymesh surfaces derived from CT scans have been widely used for the acquisition of 3D landmark sets (e.g. Kulemeyer et al. 2009; Bilfeld et al. 2013; Wang et al. 2015; Kesterke et al. 2018), measurement error associated with this particular workflow has, to our knowledge, rarely been considered beyond artefactual intra‐ and inter‐observer variance. Observer‐introduced variance has been reported by, for example, Valeri et al. (1998) and Barbeito‐Andrés et al. (2012). Gunz et al. (2012) found that the spatial resolution of microCT (μCT) scans of the mammalian bony labyrinth and the specific thresholds selected for surface generation from volumetric data affected landmark measurements. Threshold selection as source of measurement error was also noted by Williams & Richtsmeier (2003). Simon & Marroig (2015) found no considerable difference in landmark precision when different voxel sizes were used; however, they identified the use of different filters during μCT scanning as a potential source of measurement error.

Studies outside the field of geometric morphometrics indirectly support the notion that methodological decisions related to CT scanning and data processing could contribute to measurement error in landmark data acquired from CT‐derived surfaces. Measurements such as linear distances or volumes have changed considerably, with varying automatically or manually selected thresholds for surface generation (e.g. Coleman & Colbert, 2007; Parkinson et al. 2008). Voxel size has had a considerable effect on measurements and the effect of the segmentation method seemed to increase with voxel size (Christiansen, 2016). Scanner type and imaging conditions have affected the geometry of surfaces generated from CT scans, although the effects have been small compared with the variance introduced by manual segmentation (Colman et al. 2017). All the factors influencing the segmentation results and the geometry of a CT‐derived surface potentially contribute to measurement error in landmark data obtained from such surfaces.

The overall magnitude of artefactual variance in measurements acquired from CT data might be small. For example, linear measurements obtained from CT scans of human skulls were shown to differ insignificantly from corresponding measurements on the original skulls (Lorkiewicz‐Muszyńska et al. 2015; but see Hildebolt et al. 1990 and Richtsmeier et al. 1995 for contradictory reports). Furthermore, the variance of linear distances derived from landmarks acquired from repeated scans of the same specimen has been low (Richtsmeier et al. 1995). Finally, various measures of trabecular bone have not differed significantly from measures obtained from histological sections when using an appropriate threshold (Fajardo et al. 2002).

The frequent use of CT‐derived surfaces in geometric morphometric studies, the scarce and unsystematic assessment of measurement error related to this particular workflow, and the partly contradictory reports on this topic in literature call for further analyses. The present study aims to investigate systematically the contribution of different factors in surface generation and landmark acquisition to the total variance in landmark data. Measurement error due to voxel size, segmentation strategy (i.e. the use of different thresholds), surface simplification, and inter‐ and intra‐observer differences was assessed and compared with true biological variance due to inter‐specimen and inter‐specific differences, as well as the variation between body sides of a given specimen. Specimens from two species of the genus Bombina (Amphibia: Anura: Bombinatoridae) were chosen for obtaining landmark data from the bones of their pectoral girdles. The pectoral girdle of these species consists of two roughly C‐shaped halves (in anterior view) that are ventrally overlapping. Each half comprises four bones (Figs 1A and Supporting Information Fig. S1) connected by cartilaginous elements (Maglia & Púgener, 1998). Recommendations for the reduction of measurement error in landmark data of the anuran pectoral girdle have been derived; these recommendations might be applicable for CT‐derived landmark data of other biological structures as well.

Figure 1.

Figure 1

Selected surface variants of the pectoral girdle bones of Bombina orientalis (ZMH A12601, ventral girdle half only), lateral view. All surfaces are based on the same μCT scan. Scale: 2 mm. Colour coding denotes distance of the respective surface to the surface in (A). (A) Surface variant generated using the MidGrey thresholding algorithm with the intersecting‐2‐of‐3 strategy and no simplification. (B) Surface variant generated using the Otsu thresholding algorithm with the intersecting‐3 strategy and no simplification. (C) Surface variant generated using a dynamically adapted subjectively optimal threshold and no simplification. (D‐F) Subjectively optimal simplified variants of surfaces (A‐C), respectively. See text for more details on surface generation.

Materials and methods

MicroCT scans of several specimens were used to generate different polymesh surface variants of each scan. Those surface variants differed, for example in the underlying segmentation and the degree of surface processing. The surface variants were repeatedly landmarked by different observers to systematically acquire four different sets of landmark configurations (‘Landmark Datasets’). Those Landmark Datasets served to analyse different aspects of measurement error.

Specimens and μCT scanning

MicroCT scans of nine specimens of Bombina orientalis (Boulenger, 1890) (Amphibia: Anura: Bombinatoridae) and nine of Bombina bombina (Linnaeus, 1761) were performed using either a Skyscan 1172 (Bruker microCT), Phoenix Nanotom S (General Electric) or a YXLON FF20 CT or FF35 CT (YXLON International GmbH; Table 1). The specimens were mounted with wadding in plastic containers and CT‐scanned in an ethanol‐saturated atmosphere. Volumetric datasets were reconstructed from X‐ray projections using the reconstruction software delivered with the respective scanner.

Table 1.

Specimens, pectoral girdle sizes, and parameters of μCT scanning.

Species (Collection number) Girdle size [mm] Scanner Current (μA) Voltage (kV) Filter Voxel size [μm] CNR (bone–soft tissue)
Bombina bombina (ZMH A05110) 11.95 YXLON FF35 CT 120 100 22.75 21.34
8.09
Bombina bombina (ZMH A05383) 9.56 YXLON FF35 CT 120 100 22.75 19.93
6.74
Bombina bombina (ZMH A05617) 12.02 YXLON FF20 CT 80 80 25.84 48.70
7.33
Bombina bombina (ZMH A05619) 10.98 YXLON FF35 CT 120 100 22.75 36.87
7.70
Bombina bombina (ZMH A06659) 10.40 SkyScan1172 100 100 Al 0.5 mm 26.68 16.76
8.01
Bombina bombina (ZMH A06683) 10.42 SkyScan1172 100 100 Al 0.5 mm 26.68 14.81
7.55
Bombina bombina (ZMH A06685) 10.57 SkyScan1172 100 100 Al 0.5 mm 26.68 18.85
7.57
Bombina bombina (ZMH A06690) 10.73 SkyScan1172 100 100 Al 0.5 mm 26.68 20.29
8.05
Bombina bombina (ZMH A09674) 7.04 SkyScan1172 200  49  Al 0.5 mm 21.34 24.98
5.71
Bombina orientalis (ZMH A05672) 10.72 YXLON FF35 CT 120 100 30.3 48.88
7.46
Bombina orientalis (ZMH A05676) 10.83 SkyScan1172  100 100 Al 0.5 mm 21.34 11.21
7.28
Bombina orientalis (ZMH A05677) 12.00 YXLON FF35 CT 120 100 22.75 33.43
8.07
Bombina orientalis (ZMH A05681) 12.97 YXLON FF35 CT 120 100 22.75 33.86
7.23
Bombina orientalis (ZMH A05682) 13.60 YXLON FF35 CT 120 100 22.75 35.31
6.82
Bombina orientalis (ZMH A12601) 9.53 Nanotom S 170 60 23.37 36.17
8.06
Bombina orientalis (ZMH A14347) 12.37 SkyScan1172 100 100 Al 0.5 mm 26.68 20.85
7.18
Bombina orientalis (ZMH A14350) 11.67 SkyScan1172 100 100 Al 0.5 mm 26.68 20.88
6.81
Bombina orientalis (ZMH A14354) 11.97 SkyScan1172 100 100 Al 0.5 mm 26.68 24.09
7.65

CNR, contrast‐to‐noise ratio calculated by dividing the difference of the mean grey values of pectoral girdle bones and surrounding soft tissues by the standard deviation of soft tissues; Girdle size, first value gives distance between anterodorsal tips of scapulae, second value gives mean distance between anteromedial tip of clavicula and dorsal end of anterior margin of cleithrum.

The contrast‐to‐noise ratio (CNR) was calculated for each μCT scan by dividing the difference of the mean grey values of the pectoral girdle bones and the surrounding soft tissues by the standard deviation of the soft tissue grey values. The volume considered for mean bone grey value calculation was determined in amira® (version 6.0.1; Konrad‐Zuse‐Zentrum Berlin, FEI Visualization Sciences Group) by manually segmenting the bones (Magic Wand tool) and shrinking the selection by two voxels; the volume for calculating the mean value of the soft tissues was arbitrarily chosen. Mean grey values and standard deviations were calculated using the Material Statistics module.

The size of the pectoral girdle was recorded for each specimen by determining the distance between the anterodorsal tips of the two scapulae, as well as by averaging the left and right distances between the anteromedial tip of the clavicula and the dorsal end of the respective anterior margin of the cleithrum. Measurements were performed on an Isosurface of the result of the MidGreyT segmentation strategy (see below) in amira®

Comparison of automatic local thresholding strategies

The anatomical structures of interest in a CT scan need to be segmented before polymesh surfaces of them can be generated. Segmentation can be done by using grey value thresholds that allow for the discrimination of different tissue types due to differences in their X‐ray absorption. Based on our personal experiences in CT data segmentation, we expect automatic local thresholding algorithms to be superior to automatic global thresholding, and to subjective thresholds determination by eye. However, irrespective of the performance of a given automatic thresholding algorithm, adjacent structures of similar X‐ray densities need to be separated manually.

Several automatic local thresholding algorithms are available in the Auto Local Threshold plugin (Landini, https://imagej.net/Auto_Local_Threshold, accessed 23 February 2018) for the image processing tool Fiji (based on imagej version 1.51n; Schindelin et al. 2012; Schneider et al. 2012). Those algorithms were compared with regard to the quality of the thresholding results in order to determine the best segmentation strategy for the μCT scan of a haphazardly selected B. orientalis specimen (ZMH A12601). To do so, a synthetic image stack with defined bone‐, soft tissue‐ and background‐areas (‘phantom images/stack’) was created and virtually CT‐scanned (including the simulation of image noise). The thresholding algorithms were applied to the synthetic CT image stack and the thresholding results were then compared with the phantom stack to assess the thresholding quality. Selected algorithm‐parameter combinations were applied to resliced versions of the synthetic CT image stack; all thresholding results derived by a given algorithm–parameter combination were combined and the thresholding quality was assessed as above (compare Supporting Information Fig. S2; for a more detailed description of the workflow see Supporting Information Text S1). Assuming that real CT scans behaved similarly to the synthetic CT scan during automatic local thresholding, this approach allowed for the determination of a segmentation strategy that would generate surface geometries close to the real form of the specimens. For the scan of ZMH A12601, the optimal algorithm–parameter combination for the synthetic scan may be expected to result in a surface close to the true shape of the specimen, because the phantom stack has been designed to simulate that particular scan.

The MidGrey algorithm with a radius of 9, a parameter of –4, and the intersecting‐2‐of‐3 strategy (see Supporting Information Text S1 for explanation) performed best with a misclassification rate of 2.5% in the evaluated volume (Supporting Information Table S1).

Segmentation

All 18 μCT scans were used to compare intra‐ and inter‐specific variation, as well as the variance between both body halves of a given specimen, with the magnitude of measurement error introduced by various factors during surface generation and landmark acquisition (Table 2). To cover a range of surface variants that might reasonably be used in geometric morphometric studies (‘reasonable’ surfaces), the pectoral girdle bones were segmented (amira®) three times using different thresholding strategies: the subjectively optimal threshold (Magic Wand tool with manual separation of anatomical structures where needed; ‘SubThresh’) and the result of two different automatic local thresholding algorithms (manual separation of anatomical structures where needed; ‘OtsuT’, ‘MidGreyT’). The subjectively optimal threshold was determined by eye for each pectoral girdle bone or part of a bone separately. A threshold was considered optimal if the bone‐soft‐tissue boundary was relatively smooth and laid centric within the grey value gradient between bone and soft tissue voxels. The local version of the algorithm by Otsu (1979) implemented in the Auto Local Threshold function was chosen as one of the automatic local thresholding algorithms because local versions of the algorithm by Otsu (1979) yielded good results in previous studies (e.g. Landini et al. 2017; Healy et al. 2018). The three‐dimensionality of the data was accounted for by using the intersecting‐3 strategy described in Supporting Information Text S1. The MidGrey algorithm implemented in the Auto Local Threshold function was chosen as a second automatic local thresholding algorithm. It was performed for all specimens with the same parameters and with the strategy that resulted in the best thresholding quality for the reconstructed phantom stack (Radius: 9, Parameter 1: ‐4, intersecting‐2‐of‐3 strategy). This segmentation strategy was chosen because it was expected to result in the most natural surfaces for the scan of the specimen selected for phantom stack generation (ZMH A12601) and to produce reasonable surfaces for the other specimens, too.

Table 2.

Factors and levels that were considered for evaluating measurement error in relation to inter‐ and intra‐specific variation and the variance between body halves of a given specimen

Factor Levels Abbreviation Remarks
Species (random) Bombina orientalis B‐ori
Bombina bombina B‐bom
Specimen (random, nested in species) (9 B. orientalis and 9 B. bombina specimens) (species abbreviation and collection‐number) Abbreviated using the abbreviation of the respective species combined with the collection number of the specimen, e.g. B‐ori‐A12601
Position (random, nested in specimen) Ventral v The pectoral girdle half the epicoracoid cartilage of which lays ventral (superficial) to the other half
Dorsal d
Segmentation (random, nested in position) Subjectively optimal threshold SubThresh
automatic local thresholding using the Otsu algorithm with intersecting‐3 strategy OtsuT Otsu algorithm with Radius: 15
automatic local thresholding using the MidGrey algorithm with intersecting‐2‐of‐3 strategy MidGreyT MidGrey algorithm with Radius: 9 and Parameter 1: –4
Simplification (random, nested in segmentation) Original, unsimplified surface Original
Subjectively optimal reduction and smoothing subSimpl
Observer (random, nested in simplification) KE O1 3 repetitions on three different days (one repetition per day)
JH O2 3 repetitions on three different days (one repetition per day)

Factors were considered random and nested in permanova with the residual term reflecting the repetitions on the same surface variant. Landmark sets acquired from surfaces that were generated according to these factors comprise Landmark Dataset 1.

The pectoral girdle halves overlap medially in Bombina; the girdle half with the ventral (superficial) epicoracoid cartilage of the scan of a selected B. orientalis specimen (ZMH A12601) served for testing the effects of more extreme surface variants that would probably not be used in geometric morphometric studies. To simulate different scan resolutions, the voxel size of the μCT scan was decreased (binned) by merging 2 × 2 × 2 (‘Down2’) and 4 × 4 × 4 (‘Down4’) voxels, respectively (Resample module in amira®, filter: Lanczos). The bones in the original (‘NoDown’) and downsampled (Down2, Down4) stacks were segmented five times using the three segmentation strategies described above (SubThresh, MidGreyT, OtsuT), and, additionally, the lowest (‘MinThresh’) and the highest (‘MaxThresh’) thresholds that resulted in a useable surface (Table 3).

Table 3.

Factors and levels that were considered for evaluating measurement error that was introduced by segmentation and surface generation for Bombina orientalis (ZMH A12601; ventral/superficial girdle half only)

Factor Levels Abbreviation Remarks
Downsampling (random) No downsampling NoDown Voxel size: 23.37 μm
Downsampling of 2 × 2 × 2 voxels Down2 Voxel size: 46.74 μm
Downsampling of 4 × 4 × 4 voxels Down4 Voxel size: 93.48 μm
Segmentation (random, nested in resolution) Subjectively optimal threshold SubThresh
automatic local thresholding using the Otsu algorithm with intersecting‐3 strategy OtsuT Otsu algorithm with Radius: 15
automatic local thresholding using the MidGrey algorithm with intersecting‐2‐of‐3 strategy MidGreyT MidGrey algorithm with Radius: 9 and Parameter 1: –4
Lowest threshold resulting in a usable surface minThresh
Highest threshold resulting in a useable surface maxThresh
Simplification (random, nested in segmentation) Original, unsimplified surface Original
Subjectively optimal reduction and smoothing subSimpl
Strong reduction and smoothing StrongSimpl
Day (random, nested within simplification) Day 1 Day 1 3 repetitions
Day 2 Day 2 3 repetitions
Day 3 Day 3 3 repetitions

Factors were considered random and nested in permanova with the residual term reflecting the repetitions on the same day. Landmark sets acquired from surfaces that were generated according to these factors comprise Landmark Dataset 2.

Surface generation and processing

Different polymesh surface variants were generated from each of the segmentation results for the ventral and dorsal pectoral girdle halves separately and were exported (obj format) in their original condition (‘original’). In a next step, copies of these surfaces were simplified (polygon count reduction and smoothing) to a subjective optimal degree that smoothed surface irregularities nicely without losing anatomical details (‘subSimpl’). A strongly simplified surface (highest degree of simplification that resulted in a useable surface; ‘strongSimpl’) was exported for each segmentation belonging to the ventral girdle half of ZMH A12601 to cover the maximum range of possible surface variants for this specimen. Surface generation, simplification, and export were accelerated using a custom amira® macro (MultiExport, see Engelkes et al. 2018 for details).

The surfaces were converted to ply format in meshlab (version 1.3.3; Cignoni et al. 2008). Furthermore, the surfaces of the right pectoral girdle halves were mirrored to match the orientation of the left. This allowed for assessing the shape difference of both girdle halves for comparing the magnitude of intra‐specimen variation to that of measurement error. The surfaces were mirrored prior to landmark acquisition instead of, as commonly done (see Klingenberg et al. 2002; Zelditch et al. 2012), mirroring landmark sets. Mirroring surfaces avoided potential bias in landmark acquisition that could have resulted from differing surface orientations.

Landmarks and landmark acquisition

In total, 24 landmarks were defined and acquired in the software landmark (version 3.0.0.6; Wiley et al. 2005; Fig. 2, Supporting Information Table S2). The landmarks represented the shape of the shoulder girdle bones and their position to each other within the same girdle half. Some of the landmarks (e.g. 5, 11, 20, 21) might as well have been registered as parts of series of semi‐landmarks (Bookstein, 1997; Gunz et al. 2005). Yet, all defined landmarks were landmarks sensu Bookstein (2003), and we used them as such to avoid the potential shape variance associated with semi‐landmark processing/sliding (compare Perez et al. 2006).

Figure 2.

Figure 2

Landmarks on the girdle half of Bombina orientalis (ZMH A12601) that comprises the ventral (superficial) epicoracoid cartilage (MidGreyT, subSimpl surface variant). Scale: 2 mm. (A) Lateral view. (B) Medial view.

We acquired four different landmark datasets and subjected them to specific analyses to assess different aspects of potential measurement error. Each ‘reasonable’ surface created according to the factor levels in Table 2 was landmarked by two different observers (‘O1’, ‘O2’), both experienced in landmark acquisition and anuran pectoral girdle anatomy. O1 and O2 initially discussed and agreed on landmark definitions, but then acquired landmark data independently. Each observer landmarked each surface three times with each repetition on a given surface being performed on different days. The total of the landmark sets acquired by O1 and O2 will be called ‘Landmark Dataset 1’.

All 45 surfaces of the ventral girdle half of B. orientalis specimen ZMH A12601 (Table 3) were landmarked nine times by O1; for each surface variant, the landmark sets were acquired three times repeatedly on three different days (‘Landmark Dataset 2’).

One of the surfaces of specimen ZMH A12601 (NoDown, MidGreyT, original, ventral girdle half) was landmarked by six different observers (O1, O2, and inexperienced ‘O3’–‘O6’). Each observer landmarked the surfaces in two sessions on consecutive days with 20 repetitions per session. Inexperienced observers were trained and corrected by O1 during the first five repetitions of the first session. The first five repetitions of each session were discarded, the remaining landmark sets were used for analysis (‘Landmark Dataset 3’).

The ‘reasonable’ surface variants (MidGreyT, OtsuT, SubThresh; original, subSimpl) of the ventral girdle halves of two B. orientalis specimens were selected based on the CNRs (bone–soft tissue) of the respectively underlying μCT scans. The first specimen (ZMH A05682) was selected because its μCT scan had the CNR (bone–soft tissue) closest to the one of ZMH A12601; the automatic local thresholding algorithms with respective parameters optimal for the scan of ZMH A12601 might thus be expected to be close to optimal for the scan of ZMH A05682, too. The second specimen (ZMH A05676) was chosen because its scan had a CNR (bone–soft tissue) that was the most different from the one of ZMH A12601; thus, the automatic local thresholding algorithms with respective parameters applied might be expected to be suboptimal for the ZMH A05676 scan. The surfaces were landmarked nine times by O1; for each surface variant, the landmark sets were acquired three times repeatedly in three different sessions with at least 10 h between sessions (‘Landmark Dataset 4’). Supporting Information Table S3 gives an overview of the composition of Landmark Datasets 1–4.

Superimposition

All surfaces derived from the same μCT scan had the same position in space. Therefore, landmark sets on the surfaces of the same girdle half should, in theory, be in perfect superimposition and all remaining variance in landmark position would have to be the result of measurement error. A superimposition of these landmark sets might mask or alter potential measurement error under this condition (compare Corner et al. 1992; von Cramon‐Taubadel et al. 2007; Ross & Williams, 2008). Therefore, we computed the mean landmark configuration of each girdle half in Landmark Dataset 1 and only these mean configurations were superimposed using a full Generalized Procrustes Analysis (full GPA; procGPA function of shapes package version 1.2.3 for R version 3.4.3 in rstudio version 1.1.383; Dryden, 2017; R Core Team, 2017; RStudio Team, 2017). The transformations applied to each mean landmark configuration were determined by superimposing each untransformed mean configuration onto the corresponding transformed configuration. Centroid coordinates and centroid size were computed (Zelditch et al. 2012) and used to determine translation and scaling parameters. The rotation matrix was computed as the matrix UV T, where V and U were the left and right matrices of the singular value decomposition VΓU T of the product of the transposed transformed and untransformed mean landmark matrices (Dryden & Mardia, 2016, lemma 4.2). The transformations of the mean configurations were applied to all corresponding landmark sets. This allowed for a full Procrustes superimposition among the means of the separate girdle halves of the specimens while preserving the variance (measurement error) within a given girdle half for further analysis.

The transformed landmark sets of Landmark Dataset 1 were uniformly rescaled such that the centroid size of the mean landmark configuration of the ventral girdle half of ZMH A12601 after superimposition equalled the corresponding centroid size before superimposition (‘superimposed Landmark Dataset 1’). This allowed for a maximum comparability to Landmark Datasets 2 and 3. All computations were performed using basic R functions and functions of the packages geomorph (version 3.0.5; Adams et al. 2017), abind (version 1.4‐5; Plate & Heiberger, 2016), and shapes. Landmark Datasets 2, 3, and 4 were analysed without any superimposition as the analyses were performed for each scan/specimen separately.

Visualization

Landmark locations were visualized for the ventral girdle half of B. orientalis specimen ZMH A12601 (NoDown, MidGreyT, subSimpl) in modo® (version 10.1v2; The Foundry) by creating spheres with midpoint coordinates adopted from an arbitrary landmark set acquired from that surface. Deviations among the ‘reasonable’ surface variants of selected specimens (ventral girdle halves of ZMH A05110, A05619, A05681, A09674, A12601) were visualized in GOM Inspect 2017 (GOM GmbH). Local deviations were calculated and colour‐coded per vertex (Fig. 1); simplified surfaces had a low vertex count and their vertex count was increased prior to visualization (MODO®, Subdivide: Faceted function) for higher resolution distance mapping in GOM.

Principal component analyses were performed for Landmark Datasets 1–3 separately in R. The results were visualized using the packages ggplot2 (Wickham, 2016) and ggpubr (version 0.1.6; Kassambara, 2017). All figures were arranged in Adobe® illustrator ® CS6 (version 16.0.3; Adobe® Systems Software).

Permutational analyses of variance

Permutational manovas (permanovas; Anderson, 2001) were performed for each of (superimposed) Landmark Datasets 1–3, treating all factors as random and nested (Arnqvist & Mårtensson, 1998). In particular, for the superimposed Landmark Dataset 1, the residual term reflected the variance of the repetitions by each observer and was nested within observer. Observer was nested within simplification, which was nested within segmentation, which was nested within position, which was nested within specimen, which was nested within species (compare Table 2). For Landmark Dataset 2, the residual term reflected the variance of the repetitions of the same day and was nested within the factor day. Day was nested within simplification, which was nested within segmentation, which was nested within downsampling (compare Table 3). For Landmark Dataset 3, the residual term reflected the variance of the repetitions of the same day and thus was nested within day; day was nested within observer. P‐values < 0.05 were considered significant for all tests.

permanovas were computed in R using the adonis function of the vegan package (version 2.4‐5; Oksanen et al. 2017) with Euclidean distance as distance measure and 9999 permutations. Permutations were performed for each factor separately using the mean configurations of the nested groups defined by the respective next‐lower factor (ensuring correct computation of F‐values) and restricted within the groups defined by the next‐higher factor (compare Anderson & ter Braak, 2003). Permutations of the lowest factor were performed on the landmark sets; permutations of the highest factor were unrestricted. To test, for example, the significance of the factor specimen in the superimposed Landmark Dataset 1, the average landmark configurations of each girdle half were computed for each specimen (in other words, the mean configurations of the groups defined by the next‐lower factor position nested within specimen). Permutations were restricted within species and the reduced model comprised the factor ‘specimen’ nested within species; if there had been higher factors, those would have been part of the reduced model as well. The corresponding F‐ and P‐values of the full model (correct values for degrees of freedom, sum of squares, and mean squares) were replaced by those of the reduced models. The relative contribution of each factor to the total variance (variance components expressed as percentages) in a respective landmark dataset was computed following Sokal & Rohlf (1981); negative variance components were set to zero and not considered for percentage calculation (only applicable for the factor simplification in Landmark Dataset 1).

Using different scanner types and software packages for volume data acquisition might have induced artefactual variance in the superimposed Landmark Dataset 1; we did not account for this, as a previous study found only minor effects of scanner type on the geometry of derived polygon surfaces (Colman et al. 2017). If there was artefactual variance caused by CT scanning and volume reconstruction, this would be incorporated in the factor ‘specimen’ in the permanova. Consequently, the measured variance due to specimen would artefactually be higher than it actually was and measurement error would be slightly underestimated.

Variance within and between subgroups

Selected sets of pairwise Euclidean distances between landmark configurations were computed to compare informally the magnitudes of the variation within different subgroups. The distance sets were individually pooled according to selected factors (i.e. distances within sub‐groups were treated as one distance set if they were associated with the same level of a given factor) to assess informally the dependence of variance within subgroups on the different levels of the factors. A small variance, and thus an overall high precision in landmark placing, became apparent in short pairwise distances within the (pooled) subgroups. Notched boxplots (McGill et al. 1978) of the (pooled) groups of distances were used to identify substantial differences among groups. All calculations were performed in R.

In particular, for superimposed Landmark Dataset 1, the pairwise Euclidean distances of the 18 landmark sets acquired from the same girdle half by the same observer were calculated for each observer separately, resulting in 153 distances per observer per girdle half. The distances were pooled by specimens; in particular, all distances of a given specimen, calculated within observer and girdle halves separately, were treated as one set of distances. This allowed for an informal assessment of the precision with which the surfaces of a given scan could be landmarked. For Landmark Dataset 2, pairwise distances were computed among the nine landmark sets acquired from the same surface variant, resulting in 45 sets of 36 distances each. The distance sets were pooled according to the levels of downsampling, segmentation, and simplification, respectively, to assess informally the dependence of the overall landmark precision on the respective levels of the factors. Similarly, for Landmark Dataset 4, pairwise distances were computed among the nine landmark sets acquired from the same surface variant.

Landmark precision and observer differences

All landmark sets in Landmark Dataset 3 were acquired from the same surface and thus were superimposed without any GPA. This allowed for the direct analysis of the precision with which the landmarks could be placed. Therefore, the standard deviation of each landmark in Landmark Dataset 3 was calculated according to von Cramon‐Taubadel et al. (2007) for each observer and each day separately to measure the precision with which a given landmark could be placed. Separate Wilcoxon signed rank tests were performed for each observer to test for significant differences in the landmark standard deviations between days. The Euclidean distances of each landmark configuration in Landmark Dataset 3 from the mean configuration obtained by O1 (O1 trained all observers, therefore the mean configuration of O1 was set as reference) were calculated to assess the similarity of the shapes measured by the different observers to the reference. The consideration of the day and order, in which the landmark sets were acquired, allowed for assessing trends in potential systematic deviation from the reference by linear regression. Calculations and visualizations were done in R using the above‐mentioned packages and plotrix (Lemon, 2006).

Results

‘Reasonable’ surface variants and true biological variation

A visual comparison of the ‘reasonable’ surface variants (MidGreyT, OtsuT, SubThresh; original, subSimpl) of the ventral pectoral girdle halves of selected specimens (see Fig. 1 for ZMH A12601) revealed that surfaces of a given specimen rarely differed by more than two voxels; the highest deviations mainly occurred in areas where no landmarks had been placed. Subjective optimal surface simplification (subSimpl) seemed to have a smaller effect than one voxel. This indicates that the simplification removed the voxel‐steps from the surfaces while maintaining the gross geometry and the anatomical details (e.g. no artefactual deformation of edges).

A plot of the first two principal components (56.56 and 10.24% of total variance, respectively) of the superimposed landmark sets acquired from the ‘reasonable’ surface variants (superimposed Landmark Dataset 1; Fig. 3A) showed that the sets from a given girdle half generally clustered together irrespective of surface variant and observer. Most girdle halves were separated along principal components 1 and 2. The specimens with overlapping regions in the first two principal components were separated along the third through fifth (6.64, 4.3, and 4.08%, respectively; latter not shown; Fig. 3B) principal components. Boxplots (Fig. 4) of pairwise Euclidean distances between the landmark sets acquired by a given observer from a given girdle half pooled by specimens showed considerable differences in the overall variance in the landmark data among specimens.

Figure 3.

Figure 3

Plots of principal components of the landmark sets acquired from the ‘reasonable’ surface variants (Landmark Dataset 1). Convex hulls encircle all landmark sets of a given girdle half acquired by the same observer. Specimens denoted by color, position by transparency of filling of the convex hull, segmentation by the type of the symbol, simplification by the filling of the symbol, and observer by the type of the line used for the convex hull. (A) Principal components 1 and 2. (B) Principal components 3 and 4.

Figure 4.

Figure 4

Boxplots of pairwise distances between landmark sets of a given girdle half acquired by the same observer pooled by specimens (used as informal measure of overall landmark precision by specimens).

A nested permanova of superimposed Landmark Dataset 1 revealed significant contributions of the factors species, specimen, position, segmentation, and observer to the total variance in the landmark data (Table 4), with specimen being the major factor, accounting for 47.92% of total variance. 93.25% of the total variance was caused by true biological variation. The major factor causing artefactual variance was intra‐observer error with a contribution of 3.1% to the total variance, followed by inter‐observer error, which contributed 2.86%. Segmentation accounted for 0.79% and the factor simplification was not significant.

Table 4.

Nested permanova of the landmark sets acquired from ‘reasonable’ surface variants of different specimens (superimposed Landmark Dataset 1)

df SS MS F P Variance component (%)
Species 1 746575982 746575981.54 10.26 0.0007*** 31.37
Specimen (nested in species) 16 1164604487 72787780.41 7.71 0.0001*** 47.92
Position (nested in specimen) 18 169950185 9441676.92 44.06 0.0001*** 13.96
Segmentation (nested in position) 72 15427585 214272.02 5.30 0.0001*** 0.79
Simplification (nested in segmentation) 108 4373945 40499.49 0.19 1 0
Observer (nested in simplification) 216 46331768 214498.93 3.77 0.0001*** 2.86
Residuals (repetitions nested in observer) 864 49175843 56916.48 3.10
Total 1295 2196439793

All factors treated as random; permutations (if applicable, of means of next‐lower factor) performed for each factor separately and, if necessary, restricted within groups of next‐higher factor.

***

P ≤ 0.001.

Maximum range of surface variants

Figure 5A shows a plot of the first two principal components of the landmark sets that were acquired from the maximum range of surfaces of a selected girdle half (Landmark Dataset 2). Principal components 1 and 2 represented, respectively, 32.49 and 14.94% of the total variance in the landmark data. Landmark sets from surfaces generated with different segmentation strategies and different degrees of surface simplification were roughly separated along the first principal component, and the strongest downsampling degree (Down4) was roughly separated from the other two (NoDown, Down2) along the second and third (8.64% of total variance, not shown) principal components. There was no obvious pattern of clustering in plots of higher principal components (not shown), which indicates a rather random variation of landmark data along these components. However, not all groups of landmark sets of distinct surface variants were perfectly separated. This potentially indicated considerable similarity of the corresponding surfaces and was particularly true for subjectively optimal and strongly simplified (subSimpl, strongSimpl) surface variants of the strongly downsampled (Down4) volume.

Figure 5.

Figure 5

Principal component plot and Euclidean distances of landmark sets acquired from the maximum range of surface variants of the ventral pectoral girdle half of Bombina orientalis (ZMH A12601; Landmark Dataset 2). (A) Plot of first two principal components. Convex hulls encircle all landmark sets of a given surface variant. Downsampling denoted by color family, segmentation by the type of the symbol, simplification by transparency of filling of the convex hull, and day by the filling of the symbol. (B) Boxplots of pairwise Euclidean distances between the full landmark configurations of each surface variant. (C) Notched boxplots of pairwise Euclidean distances of (B) pooled according to the levels of, respectively, the factors downsampling, segmentation, and simplification. (Pooled) Euclidean distances were used to informally compare the variance among the levels of the factors.

The boxplots of the Euclidean distances derived from Landmark Dataset 2 (Fig. 5B,C) generally showed the highest variations (greatest pairwise distances) for the landmark sets acquired from surfaces of strongly downsampled (Down4) volumes compared with the other two degrees of downsampling (NoDown, Down2). Within the factor segmentation and irrespective of the other factors, the landmark sets acquired from surfaces generated with the segmentation strategies MidGreyT and OtsuT, on average, showed the least variance. With regard to surface simplification, landmark sets on subjectively optimal simplified (SubThresh) surfaces generally were the least variable. This was in agreement with the subjective impression of O1 during landmark acquisition: landmark placing on subjective optimal simplified surfaces was experienced as being the easiest (also applies for other scans/specimens).

The pattern of Euclidean distances within the landmark sets of the surface variants of B. orientalis specimen ZMH A05682 in Landmark Dataset 4 (‘reasonable’ surface variants of ZMH A05676 and A05682, each repeatedly landmarked three times in three sessions; Fig. 6) was similar to that of ZMH A12601 in Landmark Dataset 2 (Fig. 5B): segmentation based on automatic local thresholding (MidGreyT, OtsuT) and subjectively optimal surface simplification (subSimpl) was advantageous with regard to overall landmark precision. For ZMH A05676 (Fig. 6), landmark sets from surface variants derived from automatic local thresholding based segmentations showed similar (OtsuT) or higher (MidGreyT) variation than those sets acquired from the surfaces created by manual threshold selection (SubThresh). The variance among landmark sets acquired from subjectively optimal simplified (subSimpl) surfaces of ZMH A05676 generally was smaller than the variance among the respective original surfaces.

Figure 6.

Figure 6

Boxplots of pairwise Euclidean distances between the landmark sets acquired from each ‘reasonable’ surface variant of Bombina orientalis specimens ZMH A05682 (left) and ZMH A05676 (right).

The nested permanova of Landmark Dataset 2 revealed a significant contribution of the factors segmentation, simplification, and day to the total variance within the landmark data. Among those factors, segmentation (52.56%) was responsible for most variation (Table 5). The variance between days was smaller than the variance within the same day and also smaller than the added variance due to surface simplification. Downsampling did not contribute significantly to the total variance.

Table 5.

Nested permanova of the landmark sets acquired from the maximum range of surface variants of the ventral (superficial) girdle half of Bombina orientalis (ZMH A12601; Landmark Dataset 2)

df SS MS F P Variance component (%)
Downsampling 2 5745432 2872715.82 1.5201 0.1397 6.14
Segmentation (nested in downsampling) 12 22678045 1889837.10 9.0826 0.0001*** 52.56
Simplification (nested in segmentation) 30 6242139 208071.28 4.5536 0.0001*** 15.22
Day (nested in simplification) 90 4112486 45694.29 1.9432 0.0001*** 6.24
Residuals (repetitions nested in day) 270 6349050 23515.00 19.84
Total 404 45127151

All factors treated as random; permutations (if applicable, of means of next‐lower factor) performed for each factor separately and, if necessary, restricted within groups of next‐higher factor.

***

 0.001.

Observer error and landmark precision

A plot of the first two principal components (33.26 and 28.08% of total variance, respectively) of Landmark Dataset 3 (different observers on same surface, 20 repetitions on each of two days, first five repetitions of each day discarded) showed that the landmark sets of a given observer generally cluster together (Fig. 7A), with overlapping areas between O1 and O3, as well as between O5 and O6. The Euclidean distance of each landmark set to the mean configuration of O1 (set as reference) revealed that the deviations of O2, O3, O5, and O6 were more or less constant over time (Fig. 7B). The deviations of O4 initially fell in the same range as those of O2, O3, O5, and O6, but they increased with time.

Figure 7.

Figure 7

Principal component plot and Euclidean distances of landmark sets to the mean landmark configuration of O1 derived from Landmark Dataset 3 (six observers, one surface variant). (A) Principal components 1 and 2. Convex hulls encircle all landmark sets by a given observer acquired on the same day. Observer denoted by symbol type and color, and day by the type of the line used for the convex hull. (B) Euclidean distances of the full landmark sets to the mean landmark configuration of O1 (set as reference) by day and order of acquisition. Regression lines visualize trends in deviation from the reference.

A nested permanova of Landmark Dataset 3 revealed that the factors observer and day significantly contributed to the total variance (70.36 and 5.53%, respectively; Table 6). The variation within the repetitions of a given day contributed 24.1%.

Table 6.

Nested permanova of the landmark sets acquired by different observers from the same surface variant (MidGreyT, original) of ventral (superficial) girdle half of Bombina orientalis (ZMH A12601; Landmark Dataset 3)

df SS MS F P Variance component (%)
Observer 5 8432348 1686469.66 20.71 0.0002*** 70.36
Day (nested in observer) 6 488501 81416.87 4.44 0.0001*** 5.53
Residuals (repetitions nested in day) 168 3078948 18327.07 24.10
Total 179 11999797

All factors treated as random; permutations (if applicable, of means of next‐lower factor) performed for each factor separately and, if necessary, restricted within groups of next‐higher factor.

***

 0.001.

The smallest standard deviation across observers was below 10 for all landmarks, with the exception of landmarks 4 and 20 (10.11 and 11.99, respectively; Fig. 8). Some landmarks (e.g. 2, 11, 24) were acquired with consistent high precision by all observers, whereas the standard deviations of others (e.g. 7, 18, 20) greatly differed among observers. Wilcoxon signed rank tests revealed no significant differences in the standard deviations of the different days for O1 and O2 (P‐values 0.2522 and 0.1974, respectively). For O3–O6, the standard deviations significantly differed among days (0.0031, 0.0164, 0.0007, and 0.0005, respectively); the standard deviations of the second day were generally smaller, but exceptions occurred.

Figure 8.

Figure 8

Landmark standard deviations calculated for each observer and each day separately (derived from Landmark Dataset 3).

Discussion

In μCT‐based 3D geometric morphometrics, data goes through several processing steps, each of which may add artefactual variance to the final landmark data. Researchers often follow commonly used procedures and protocols without full quantitative appreciation of measurement error that is potentially introduced with each of the processing steps. We intended to assess the artefactual variance that had been added during surface generation and landmark acquisition to see whether some of the steps are more critical than others, and to derive recommendations for measurement error reduction. We identified variance introduced by observer and segmentation as the main sources of measurement error. Training periods prior to landmark acquisition, landmark acquisition in as few sessions as possible, careful surface simplification, and the use of case‐specific optimal segmentation strategies can potentially help reduce measurement error.

Quality of automatic local thresholding

Our choice of automatic local thresholding algorithms and corresponding parameters was based on previous studies (Otsu algorithm; Landini et al. 2017; Healy et al. 2018) and on the performance of different algorithm–parameter combinations applied to a rather arbitrarily created stack of reconstructed phantom images (MidGrey algorithm). We did not evaluate the effects of image noise, pixel size or contrast of bone and soft tissue. These factors might have a significant effect on the thresholding quality. Furthermore, various other thresholding methods have been published (reviewed, e.g. by Sezgin & Sankur, 2004); these were not considered herein due to the limitation to the use of the Auto Local Threshold plugin. It is likely that the algorithms and parameters, as well as the strategy of thresholding resliced versions of the μCT stacks and combining them later on, may not have led to optimal results for all μCT scans. We believe, however, that our automatic local thresholding strategies yielded good results, as a visual control showed acceptable surfaces; our aim was to cover a range of surfaces to assess the effect of different segmentation strategies in CT‐based geometric morphometric studies, not to rate the performance of different automatic thresholding strategies. Still the questions remain, which automatic thresholding strategy results in a binarization closest to reality, and how the choice of the optimal thresholding strategy depends on, for example, the scan quality. Answering these questions will allow avoidance of measurement error caused by using unnaturally shaped surface geometries for landmark acquisition.

Observer error and surface simplification

Measurement error caused by artefactual variance between and within observers is a common phenomenon in geometric morphometric studies (e.g. Valeri et al. 1998; Barbeito‐Andrés et al. 2012). The data reported in the present study are no exception: inter‐ and intra‐observer errors were the major source of artefactual variance in landmark data that represented a realistic geometric morphometric dataset (superimposed Landmark Dataset 1; Table 4). In the superimposed Landmark Dataset 1 the relative contributions of inter‐ and intra‐observer error were similar, the latter only slightly exceeding the former. This observation contradicts previous findings that inter‐observer error commonly exceeds intra‐observer error (e.g. Singleton, 2002; Wilson et al. 2011). The pattern observed in our data might be caused by the composition of Landmark Dataset 1: repetitions on a given surface were performed on different days and all other surfaces were landmarked in‐between repetitions. This might have prevented any effect of memorizing exact landmark positions from the previous repetition. In addition, the number of surfaces landmarked consecutively might have caused some kind of fatigue which, in turn, might have led to inattentive landmark placing, causing higher intra‐observer error. Further, inter‐observer error in Landmark Dataset 1 might have been exceptionally small, as both observers extensively discussed landmark positions prior to landmark acquisition. The pattern observed in Landmark Dataset 3, acquired by six partly inexperienced observers, agrees well with other reports, as inter‐observer error has clearly exceeded intra‐observer error (Table 6). The observed relatively large Euclidean distances of O2 to the mean landmark configuration of O1 (reference) and the increasing distances of the landmark sets acquired by O4 (Fig. 7) indicate systematic deviations. These suggest that there were differences in how observers identified landmarks (i.e. they did not identify exact homologous points as the spots where to place the landmarks) and that they placed the landmarks at different points. The particular pattern observed for O4 might also have been attributed to O4 having a ‘bad day’, leading to inconsistent landmark placing. Considering, however, that the deviations of O4 seem to have a direction, having a ‘bad day’ seems to be an unlikely explanation for the observed pattern. Inconsistency in landmark identification even among experienced observers has also been reported by Shearer et al. (2017) and seems to be a common phenomenon. This should caution researchers to make sure that landmarks are precisely defined and that all observers place landmarks at exactly homologous points; in‐person training in landmark collection can help to minimize inter‐observer error (Shearer et al. 2017). The observers should be trained to have a thorough knowledge of the landmark acquisition techniques, as well as of the biological landmarks and the variability in their expressions (Corner et al. 1992). Taking into account that there are deviations even between experienced observers, it seems advisable to test repeatedly for inter‐observer inconsistencies in landmark identification over time to prevent observers from systematically deviating from the defined landmarks by developing their own landmark definitions.

Despite the considerable amount of inter‐observer error observed in previous works and herein, previous accounts (e.g. Singleton, 2002; Chang & Alfaro, 2016) and the fact that all specimens in our superimposed Landmark Dataset 1 were well‐separated along the first few principal components, show that landmark data acquired by different observes can give results that are precise enough to allow correct biological inferences. If measurement error is crucial, that is, if the biological variation of interest is small relative to error, as for example in the analysis of asymmetry (Klingenberg, 2015; Robinson & Terhune, 2017; Shearer et al. 2017), it seems advisable to exclude inter‐observer error by having landmark acquisition be performed by only one observer. Other strategies to cope with inter‐observer error have been suggested in the literature (see Fruciano, 2016 and Fruciano et al. 2017 for a more detailed discussion).

Intra‐observer error contributed considerably to measurement error in parts of the present study. A common suggestion in the literature is to reduce intra‐observer error by performing repeated measurements of landmarks and to use them or their averages for analyses (Corner et al. 1992; Arnqvist & Mårtensson, 1998; also see review by Fruciano, 2016). In our study, the considerable decrease in landmark standard deviations on the second day exclusively in inexperienced observers (Fig. 8) implies an increase in the precision of landmark placing with experience. A similar learning effect has been reported by Valeri et al. (1998). As with inter‐observer error reduction, including a training period seems to be an effective way of decreasing intra‐observer error, too (also see Chang & Alfaro, 2016).

Our data suggest that explicitly using surfaces that allow for a high precision in landmark placing is an additional way to reduce intra‐observer error. The relatively small Euclidean distances between repeated landmark sets of a given subjectively optimal simplified (subSimpl) surface variant imply that a slight surface simplification generally increases the precision with which landmark coordinates can be acquired (Figs 5B,C and 6), whereas the alteration of the surface geometry seems negligible (Fig. 1, Table 4). With stronger simplification, however, negative effects set in and increase the artefactual variance (Fig. 5B,C, Table 5) by altering more and more the surface geometry and by decreasing the landmark placing precision in most cases. Generally, we recommend the use of surface simplification, but this has to be applied with caution; its effect in a given study should be assessed appropriately. Our data suggest that another way of obtaining surfaces that allow for a high precision in landmark acquisition might be the application of wisely chosen segmentation strategies; segmentation based on automatic local thresholding using case‐specific optimal combinations of thresholding algorithms and parameters seems to outperform manual thresholding (Figs 5B,C and 6, and see below).

Error within and between days

In Landmark Datasets 2 and 3, the artefactual variance within the repetitions on the same day exceeded the variance between days (Tables 5 and 6). This observation is counterintuitive in that, when repeating landmark acquisition on the same day and on different days, one would expect that, due to short‐term memory effects, the measurement error between days would be greater than within repetitions on the same day. All landmark sets in Landmark Dataset 2 were acquired from different surface variants of the same specimen and O1 knew the anatomical peculiarities of that specimen very well. As a consequence of this, it seems likely that O1 remembered the exact landmark positions the next day, which might have led to unusually small variance between days. The high number of consecutive repetitions on the same day might have led to inattentive landmark placing which caused higher measurement error within days. The observed pattern of observer‐dependent variance in Landmark Dataset 3 might have similar causes. Additionally, it seems likely that, for the inexperienced observers, it took several repetitions to find their own way of identifying landmarks; this would have increased the observer error on the first day.

Error caused by segmentation

Segmentation based on automatic local thresholding algorithms (MidGreyT, OtsuT) outperformed manual thresholding in the two B. orientalis specimens ZMH A05682 and ZMH A12601 (ventral girdle halves only; Figs 5B,C and 6), for which the combination of thresholding algorithm and parameters likely was close to optimal. For these two specimens, automatic local thresholding had two advantages: first, the derived surfaces most likely had a geometry closer to the real bones than did the surfaces of other segmentation strategies and, second, the generated surfaces allowed for placing landmarks with higher precision and thereby helped reducing intra‐observer error. These positive effects, however, did not apply to all specimens. Among the surfaces of ZMH A05676 (ventral girdle half only), the surface derived by manual threshold selection (SubThresh) allowed for equal or higher precision in landmark placing and, thus, outperformed some of the automatic‐local‐thresholding‐derived surfaces. One explanation might be that the quality (CNR of bone and soft tissue) of the μCT scan of ZMH A05676 was quite different from those of ZMH A05682 and ZMH A12601 and the parameters used during automatic local thresholding were less optimal for the scan of ZMH A05676. This might have resulted in a specific surface geometry less suitable for precise landmark placing. In other words, the application of the case‐specific optimal combination of automatic local thresholding algorithm and parameters can outperform other segmentation strategies with regard to measurement error, whereas non‐optimal combinations can increase measurement error. These conclusions are based on only three specimens and need to be verified in future studies. Yet, segmentation had a significant effect on measurement error in our data. Our findings corroborate previous studies (Williams & Richtsmeier, 2003; Gunz et al. 2012) that have reported thresholding to be a critical step when deriving measurements from CT‐based surfaces. This evidence suggests that special attention should be paid during the selection of the thresholding strategy in geometric morphometric studies. Ideally, the effects of using different segmentation strategies should be assessed; yet, considering that only 0.79% of variance in Landmark Dataset 1 was caused by the factor segmentation, a formal comparison of different thresholding strategies might not be mandatory. If automatic thresholding algorithms were applied, the use of case‐specific optimal algorithm–parameter combinations should be assured to prevent negative effects (i.e. increased measurement error).

Effect of voxel size

Our statistical analyses suggest that the resolution (i.e. the voxel size) of the underlying μCT‐derived volume does not significantly contribute to measurement error in landmark data (Table 5). The patterns in Fig. 5, however, indicate a considerable impact of resolution on shape measured by landmarks for surfaces derived from strongly downsampled (Down4) volumes. Downsampling had two effects. First, the more volumes were downsampled, the more the shapes measured from the derived surfaces deviated from the shape of the corresponding surface generated from the full resolution volume (NoDown; Fig. 5A). Second, landmark sets acquired from surfaces of strongly downsampled volumes were considerably more variable (higher intra‐observer error) than those of less downsampled volumes (Fig. 5B,C). The insignificance of the factor downsampling in the permanova of Landmark Dataset 3 (Table 5) might have been caused by the small number of repetitions; more surface variants of additional downsampling degrees or the addition of differently downsampled scans of other specimens might have resulted in significance. One potential reason for the peculiar pattern observed in Fig. 5A is the occurrence of non‐random variance due to downsampling. Such non‐random variance might have been too small to reach significance in the permanova but might have caused the rough separation of landmark sets according to the degree of downsampling along principal component 2.

Christiansen (2016) has reported a strong effect of voxel size on measures of trabecular bones if structures were too thin relative to voxel size. Simon & Marroig (2015), however, found no considerable difference in the precision of landmarks from volumes with a range of different voxel sizes. The evidence from previous studies and the comparable small differences between landmark sets from surfaces of non‐downsampled (NoDown) and corresponding slightly downsampled (Down2; Fig. 5A) volumes suggest that the voxel size has a minor effect on measurement error, given that it is small enough relative to the structures examined. The use of volume data with a coarse spatial resolution in relation to structure size may cause considerable artefactual variance. Thus, shape analysis of specimens of similar sizes scanned with moderately different voxel sizes might be uncritical as long as the spatial resolution is small enough. Although the generality of these conclusions remains to be proven, it still seems advisable to use volume data with a good spatial resolution to avoid any potential increase in measurement error. On the other hand, operating a CT scanner at the highest possible spatial resolution considerably increases scanning and image processing times and can cause lower signal‐to‐noise ratios in the X‐ray projections that are captured during a scan. Therefore, finding a reasonable case‐specific resolution is recommended.

With regard to our methodological approach for simulating different scan resolutions, it should be noted that the downsampling (Down2, Down4) might have decreased image noise in the volume data. As a consequence, the respective segmentation results and thus the derived surfaces might have been of a better quality than those of the non‐downsampled (NoDown) data. The variance components for the factors segmentation, simplification, and day, as well as for the repetitions within one day (residuals), might slightly underestimate the true error (Table 5). The decisions on the significance of the respective factors should not be affected by this, as all those factors are already highly significant, even with the potentially underestimated artefactual variance.

Landmark precision

The pattern of landmark standard deviations observed in Landmark Dataset 3 (Fig. 8) shows that all landmarks can be placed with high precision, which indicates that the definition of landmarks itself is sufficiently precise. The higher standard deviations observed for some landmarks might indicate a personal component in landmark identification and placement precision; that is, some observers have difficulties in placing certain landmarks, whereas others are able to place these landmarks with high precision. For dealing with such error‐prone landmarks, von Cramon‐Taubadel et al. (2007) suggested either excluding the landmarks concerned from the dataset or redefining them more accurately. It is remarkable, however, that the highest standard deviations in our data occurred on the first day and were produced by inexperienced observers. Thus, the imprecision in landmark placing in Landmarks Dataset 3 might mainly be the result of lack of experience.

The overall precision of landmarks varied considerably among specimens (Fig. 4). The lack of a clear pattern of landmark precision correlating with scan parameters or bone to soft tissue CNR (Table 1), and the lack of major differences in the variation among surface variants of a given specimen, suggest that specimen morphology has caused the differences in landmark precisions. This implies different levels of measurement error among individual specimens. Therefore, it seems advisable to use more than two or three specimens when assessing measurement error on a subsample of the actual sample.

Conclusion

In this study, most artefactual variance in superimposed Landmark Dataset 1 (acquired from ‘reasonable’ surface variants) was caused by intra‐observer error. The observed error between the days in Landmark Dataset 2 and the learning effect observed in Landmark Dataset 3 suggest that landmark data should be acquired by an experienced observer in as few sessions as possible; in this context, however, the effects of landmarking a high number of surfaces in the same session should be considered, as this might cause higher error due to fatigue. Intra‐observer error can also be minimized by using surfaces that allow for a high precision in landmark placing. Results from Landmark Datasets 2 and 4 suggest that such surfaces can be obtained by careful surface simplification as long as the resolutions of the underlying volumes are good enough, and possibly also by applying automatic local thresholding with an optimal algorithm–parameter combination.

The second largest amount of artefactual variance in superimposed Landmark Dataset 1 was due to inter‐observer error. Thus, it would be preferable that all landmark sets were acquired by only one experienced observer to avoid this type of measurement error. If landmark sets need to be acquired by multiple observers, it is important to assure that all observers are well trained and that they are placing the landmarks consistently at homologous points.

In the superimposed Landmark Dataset 1, the contribution of the segmentation strategy for surface generation to the total variance has been significant yet small compared with other factors. Still, the strong effect of the factor segmentation in Landmark Dataset 2 implies that the segmentation strategy should be carefully chosen to obtain a surface that represents the natural shape of the specimen with best morphological fidelity. Although downsampling had no significant effect in this study, there are indicators that using a reasonably high spatial resolution for CT scanning is advisable for best morphological fidelity and highest landmark precision.

Despite the significant amount of measurement error in landmark data acquired from our ‘reasonable’ surface variants, the artefactual variation was still small relative to true biological shape differences. The observed 6.75% of artefactual variance probably could have been reduced by following the recommendations above. In our opinion, this small amount of measurement error and the potential for its further reduction justify the use of μCT‐derived surfaces for 3D landmark data acquisition of anuran pectoral girdles for shape analysis between specimens and of variation between body halves within a given specimen.

Our experimental design revealed several options to reduce measurement error in the analysis of the anuran pectoral girdle shape by means of μCT‐based geometric morphometrics. These options may well apply to other biological objects; however, we still follow previous recommendations (e.g. Klingenberg, 2015; Fruciano, 2016; Robinson & Terhune, 2017) to assess systematically the effects of all factors potentially contributing to measurement error in landmark data and to compare the magnitude of artefactual variance to the biological variation of interest.

Disclosure of interests

The authors declare no conflict of interests.

Author contributions

KE, JH, and AH designed the study. AB, SB, JUH, KE, and TK were involved in μCT scanning; SG provided infrastructure and additional expertise for μCT. Landmark sets were acquired by KE, JH, JUH, and, not appearing as authors, by Juliana Lutz, Lena Schwinger, and Mehria Sedik. KE generated the surfaces, performed the analyses, and drafted the manuscript. AH and JH commented on early versions of the manuscript. All authors critically revised the manuscript and approved the final version.

Supporting information

Fig. S1. Bones of Bombina orientalis (ZMH A12601). Surface render in anterolateral view, anterior to the left.

Fig. S2. Steps to determine the optimal segmentation strategy for the μCT scan of a selected Bombina orientalis specimen (ZMH A12601) automatic local thresholding.

Table S1. Thresholding quality assessed for bone volumes and the adjacent two rows of voxels (first quality measure) of automatic local thresholding trials. The best and second best algorithm–parameter combinations were performed on the reconstructed phantom stack and its resliced derivatives.

Table S2. Landmark definitions and visualizations.

Table S3. Overview of the composition of Landmark Datasets 1–4.

Text S1. Explanations on workflow to generate a synthetic CT volume (‘phantom stack’) and to determine the case‐specific best combination of automatic local thresholding algorithm and parameters.

Acknowledgements

We sincerely thank Juliana Lutz, Lena Schwinger, and Mehria Sedik for participating in landmark data acquisition, and Frank Friedrich for fruitful discussions. We are grateful to for granting access to their respective μCT systems. This work was funded by the Deutsche Forschungsgemeinschaft (HA 2323/14‐1) and the Wilhelm‐Peters‐Fonds of the Deutsche Gesellschaft für Herpetologie und Terrarienkunde e.V.

References

  1. Adams DC, Rohlf FJ, Slice DE (2004) Geometric morphometrics: ten years of progress following the ‘revolution’. Ital J Zool 71, 5–16. [Google Scholar]
  2. Adams DC, Rohlf FJ, Slice DE (2013) A field comes of age: geometric morphometrics in the 21st century. Hystrix 24, 7–14. [Google Scholar]
  3. Adams DC, Collyer ML, Kaliontzopoulou A, et al. (2017) Geomorph: software for geometric morphometric analyses. Available from: https://cran.r-project.org/package=geomorph.
  4. Anderson MJ (2001) A new method for non‐parametric multivariate analysis of variance. Austral Ecol 26, 32–46. [Google Scholar]
  5. Anderson MJ, ter Braak Cajo JF (2003) Permutation tests for multi‐factorial analysis of variance. J Stat Comput Sim 73, 85–113. [Google Scholar]
  6. Arnqvist G, Mårtensson T (1998) Measurement error in geometric morphometrics: empirical strategies to assess and reduce its impact on measures of shape. Acta Zool Acad Sci Hung 44, 73–96. [Google Scholar]
  7. Barbeito‐Andrés J, Anzelmo M, Ventrice F, et al. (2012) Measurement error of 3D cranial landmarks of an ontogenetic sample using computed tomography. J Oral Biol Craniofac Res 2, 77–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Barbeito‐Andrés J, Bernal V, Gonzalez PN (2016) Morphological asymmetries of mouse brain assessed by geometric morphometric analysis of MRI data. Magn Reson Imaging 34, 980–989. [DOI] [PubMed] [Google Scholar]
  9. Bilfeld MF, Dedouit F, Sans N, et al. (2013) Ontogeny of size and shape sexual dimorphism in the ilium: a multislice computed tomography study by geometric morphometry. J Forensic Sci 58, 303–310. [DOI] [PubMed] [Google Scholar]
  10. Bonneau N, Bouhallier J, Simonis C, et al. (2012) Technical note: shape variability induced by reassembly of human pelvic bones. Am J Phys Anthropol 148, 139–147. [DOI] [PubMed] [Google Scholar]
  11. Bookstein FL (1997) Landmark methods for forms without landmarks: morphometrics of group differences in outline shape. Med Image Anal 1, 225–243. [DOI] [PubMed] [Google Scholar]
  12. Bookstein FL (2003) Morphometric Tools for Landmark Data: Geometry and Biology. Cambridge: Cambridge University Press. [Google Scholar]
  13. Buser TJ, Sidlauskas BL, Summers AP (2018) 2D or not 2D? Testing the utility of 2D vs. 3D landmark data in geometric morphometrics of the sculpin subfamily Oligocottinae (Pisces; Cottoidea). Anat Rec 301, 806–818. [DOI] [PubMed] [Google Scholar]
  14. Cardini A (2014) Missing the third dimension in geometric morphometrics: how to assess if 2D images really are a good proxy for 3D structures? Hystrix 25, 73–81. [Google Scholar]
  15. Chang J, Alfaro ME (2016) Crowdsourced geometric morphometrics enable rapid large‐scale collection and analysis of phenotypic data. Methods Ecol Evol 7, 472–482. [Google Scholar]
  16. Christiansen BA (2016) Effect of micro‐computed tomography voxel size and segmentation method on trabecular bone microstructure measures in mice. Bone Rep 5, 136–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Cignoni P, Callieri M, Corsini M, et al. (2008) MeshLab: an open‐source mesh processing tool. Sixth Eurographics Italian Chapter Conference, 129–136.
  18. Coleman MN, Colbert MW (2007) Technical note: CT thresholding protocols for taking measurements on three‐dimensional models. Am J Phys Anthropol 133, 723–725. [DOI] [PubMed] [Google Scholar]
  19. Collard M, O'Higgins P (2001) Ontogeny and homoplasy in the papionin monkey face. Evol Dev 3, 322–331. [DOI] [PubMed] [Google Scholar]
  20. Colman KL, Dobbe JGG, Stull KE, et al. (2017) The geometrical precision of virtual bone models derived from clinical computed tomography data for forensic anthropology. Int J Legal Med 131, 1155–1163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Corner BD, Lele S, Richtsmeier JT (1992) Measuring precision of three‐dimensional landmark data. J Quant Anthropol 3, 347–359. [Google Scholar]
  22. Cox PG, Fagan MJ, Rayfield EJ, et al. (2011) Finite element modelling of squirrel, guinea pig and rat skulls: using geometric morphometrics to assess sensitivity. J Anat 219, 696–709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. von Cramon‐Taubadel N, Frazier BC, Lahr MM (2007) The problem of assessing landmark error in geometric morphometrics: theory, methods, and modifications. Am J Phys Anthropol 134, 24–35. [DOI] [PubMed] [Google Scholar]
  24. Curth S, Fischer MS, Kupczik K (2017) Can skull form predict the shape of the temporomandibular joint? A study using geometric morphometrics on the skulls of wolves and domestic dogs. Ann Anat 214, 53–62. [DOI] [PubMed] [Google Scholar]
  25. Daboul A, Ivanovska T, Bülow R, et al. (2018) Procrustes‐based geometric morphometrics on MRI images: an example of inter‐operator bias in 3D landmarks and its impact on big datasets. PLoS ONE 13, e0197675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Dryden IL (2017) Shapes: statistical shape analysis. Available from: https://CRAN.R-project.org/package=shapes.
  27. Dryden IL, Mardia KV (2016) Statistical Shape Analysis with Applications in R. Wiley Series in Probability and Statistics. Chichester: John Wiley & Sons. [Google Scholar]
  28. Engelkes K, Friedrich F, Hammel JU, et al. (2018) A simple setup for episcopic microtomy and a digital image processing workflow to acquire high‐quality volume data and 3D surface models of small vertebrates. Zoomorphology 137, 213–228. [Google Scholar]
  29. Fajardo RJ, Ryan T, Kappelman J (2002) Assessing the accuracy of high‐resolution x‐ray computed tomography of primate trabecular bone by comparisons with histological sections. Am J Phys Anthropol 118, 1–10. [DOI] [PubMed] [Google Scholar]
  30. Fruciano C (2016) Measurement error in geometric morphometrics. Dev Genes Evol 226, 139–158. [DOI] [PubMed] [Google Scholar]
  31. Fruciano C, Celik MA, Butler K, et al. (2017) Sharing is caring? Measurement error and the issues arising from combining 3D morphometric datasets. Ecol Evol 7, 7034–7046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Gunz P, Mitteroecker P, Bookstein FL (2005) Semilandmarks in three dimensions In: Modern Morphometrics in Physical Anthropology. Developments in Primatology (ed. Slice DE.), pp. 73–98. New York: Kluwer Academic/Plenum Publishers. [Google Scholar]
  33. Gunz P, Ramsier M, Kuhrig M, et al. (2012) The mammalian bony labyrinth reconsidered, introducing a comprehensive geometric morphometric approach. J Anat 220, 529–543. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Hale AR, Honeycutt KK, Ross AH (2014) A geometric morphometric validation study of computed tomography – extracted craniofacial landmarks. J Craniofac Surg 25, 231–237. [DOI] [PubMed] [Google Scholar]
  35. Healy S, McMahon J, Owens P, et al. (2018) Threshold‐based segmentation of fluorescent and chromogenic images of microglia, astrocytes and oligodendrocytes in FIJI. J Neurosci Meth 295, 87–103. [DOI] [PubMed] [Google Scholar]
  36. Heuzé Y, Kawasaki K, Schwarz T, et al. (2016) Developmental and evolutionary significance of the zygomatic bone. Anat Rec 299, 1616–1630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Hildebolt CF, Vannier MW, Knapp RH (1990) Validation study of skull three‐dimensional computerized tomography measurements. Am J Phys Anthropol 82, 283–294. [DOI] [PubMed] [Google Scholar]
  38. Kassambara A (2017) Gpubr: ‘ggplot2’ based publication ready plots. Available from: https://CRAN.R-project.org/package=ggpubr.
  39. Kesterke MJ, Judd MA, Mooney MP, et al. (2018) Maternal environment and craniofacial growth: geometric morphometric analysis of mandibular shape changes with in utero thyroxine overexposure in mice. J Anat 233, 46–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Klingenberg C (2015) Analyzing fluctuating asymmetry with geometric morphometrics: concepts, methods, and applications. Symmetry 7, 843–934. [Google Scholar]
  41. Klingenberg CP, Barluenga M, Meyer A (2002) Shape analysis of symmetric structures: quantifying variation among individuals and asymmetry. Evolution 56, 1909–1920. [DOI] [PubMed] [Google Scholar]
  42. Kulemeyer C, Asbahr K, Gunz P, et al. (2009) Functional morphology and integration of corvid skulls – a 3D geometric morphometric approach. Front Zool 6, 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Landini G, Randell DA, Fouad S, et al. (2017) Automatic thresholding from the gradients of region boundaries. J Microsc 265, 185–195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Lee JC (1982) Accuracy and precision in anuran morphometrics: artifacts of preservation. Syst Zool 31, 266–281. [Google Scholar]
  45. Lemon J (2006) Plotrix: a package in the red light district of R. R‐News 6, 8–12. [Google Scholar]
  46. Lockwood CA, Lynch JM, Kimbel WH (2002) Quantifying temporal bone morphology of great apes and humans: an approach using geometric morphometrics. J Anat 201, 447–464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Lorkiewicz‐Muszyńska D, Kociemba W, Sroka A, et al. (2015) Accuracy of the anthropometric measurements of skeletonized skulls with corresponding measurements of their 3D reconstructions obtained by CT scanning. Anthropol Anz 72, 293–301. [DOI] [PubMed] [Google Scholar]
  48. Maglia AM, Púgener LA (1998) Skeletal development and adult osteology of Bombina orientalis (Anura: Bombinatoridae). Herpetologica 54, 344–363. [Google Scholar]
  49. Marcy AE, Fruciano C, Phillips MJ, et al. (2018) Low resolution scans can provide a sufficiently accurate, cost‐ and time‐effective alternative to high resolution scans for 3D shape analyses. PeerJ 6, e5032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. McGill R, Tukey JW, Larsen WA (1978) Variations of box plots. Am Stat 32, 12–16. [Google Scholar]
  51. Mitteroecker P, Gunz P, Windhager S, et al. (2013) A brief review of shape, form, and allometry in geometric morphometrics, with applications to human facial morphology. Hystrix 24, 59–66. [Google Scholar]
  52. Monteiro L (2000) Geometric morphometrics and the development of complex structures: ontogenetic changes in scapular shape of dasypodid armadillos. Hystrix 11, 91–98. [Google Scholar]
  53. O'Higgins P, Jones N (1998) Facial growth in Cercocebus torquatus: an application of three‐dimensional geometric morphometric techniques to the study of morphological variation. J Anat 193, 251–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Oksanen J, Blanchet FG, Friendly M, et al. (2017) Vegan: community ecology package. Available from: https://CRAN.R-project.org/package=vegan.
  55. Otsu N (1979) A threshold selection method from gray‐level histograms. IEEE Trans Syst Man Cybern 9, 62–66. [Google Scholar]
  56. Parkinson IH, Badiei A, Fazzalari NL (2008) Variation in segmentation of bone from micro‐CT imaging: implications for quantitative morphometric analysis. Australas Phys Eng Sci Med 31, 160–164. [DOI] [PubMed] [Google Scholar]
  57. Perez SI, Bernal V, Gonzalez PN (2006) Differences between sliding semi‐landmark methods in geometric morphometrics, with an application to human craniofacial and dental variation. J Anat 208, 769–784. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Plate T, Heiberger R (2016) Abind: combine multidimensional arrays. Available from: https://CRAN.R-project.org/package=abind.
  59. Pujol A, Rissech C, Ventura J, et al. (2014) Ontogeny of the female femur: geometric morphometric analysis applied on current living individuals of a Spanish population. J Anat 225, 346–357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. R Core Team (2017) R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; Available from: https://www.R-project.org/. [Google Scholar]
  61. Rabinovich SG (2006) Measurement Errors and Uncertainties: Theory and Practice. New York: Springer Science and Media Inc. [Google Scholar]
  62. Richtsmeier JT, Paik CH, Elfert PC, et al. (1995) Precision, repeatability, and validation of the localization of cranial landmarks using computed tomography scans. Cleft Palate‐Cran J 32, 217–227. [DOI] [PubMed] [Google Scholar]
  63. Robinson C, Terhune CE (2017) Error in geometric morphometric data collection: combining data from multiple sources. Am J Phys Anthropol 164, 62–75. [DOI] [PubMed] [Google Scholar]
  64. Ross AH, Williams S (2008) Testing repeatability and error of coordinate landmark data acquired from crania. J Forensic Sci 53, 782–785. [DOI] [PubMed] [Google Scholar]
  65. RStudio Team (2017) RStudio: Integrated Development for R. Boston: RStudio, Inc; Available from: http://www.rstudio.com/. [Google Scholar]
  66. Schindelin J, Arganda‐Carreras I, Frise E, et al. (2012) Fiji: an open‐source platform for biological‐image analysis. Nat Methods 9, 676–682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Schneider CA, Rasband WS, Eliceiri KW (2012) NIH image to ImageJ: 25 years of image analysis. Nat Methods 9, 671–675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Sezgin M, Sankur B (2004) Survey over image thresholding techniques and quantitative performance evaluation. J Electron Imaging 13, 146–165. [Google Scholar]
  69. Shearer BM, Cooke SB, Halenar LB, et al. (2017) Evaluating causes of error in landmark‐based data collection using scanners. PLoS ONE 12, e0187452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Simon MN, Marroig G (2015) Landmark precision and reliability and accuracy of linear distances estimated by using 3D computed micro‐tomography and the open‐source TINA Manual Landmarking Tool software. Front Zool 12, 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Singleton M (2002) Patterns of cranial shape variation in the Papionini (Primates: Cercopithecinae). J Hum Evol 42, 547–578. [DOI] [PubMed] [Google Scholar]
  72. Sokal RR, Rohlf FJ (1981) Biometry: The Principles and Practice of Statistics in Biological Research. San Francisco: Freeman. [Google Scholar]
  73. Valeri CJ, Cole TM, Lele S, et al. (1998) Capturing data from three‐dimensional surfaces using fuzzy landmarks. Am J Phys Anthropol 107, 113–124. [DOI] [PubMed] [Google Scholar]
  74. Verhaegen Y, Adriaens D, Wolf TD, et al. (2007) Deformities in larval gilthead sea bream (Sparus aurata): a qualitative and quantitative analysis using geometric morphometrics. Aquaculture 268, 156–168. [Google Scholar]
  75. Wang C, Hsu H, Wang C, et al. (2015) Quantifying floral shape variation in 3D using microcomputed tomography: a case study of a hybrid line between actinomorphic and zygomorphic flowers. Front Plant Sci 6, 724. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Wickham H (2016) Ggplot2: Elegant Graphics for Data Analysis. New York: Springer‐Verlag; Available from: http://ggplot2.org. [Google Scholar]
  77. Wiley DF, Amenta N, Alcantara DA, et al. (2005) Evolutionary morphing. IEEE Visualization.
  78. Williams FL, Richtsmeier JT (2003) Comparison of mandibular landmarks from computed tomography and 3D digitizer data. Clin Anat 16, 494–500. [DOI] [PubMed] [Google Scholar]
  79. Wilson LAB, Cardoso HFV, Humphrey LT (2011) On the reliability of a geometric morphometric approach to sex determination: a blind test of six criteria of the juvenile ilium. Forensic Sci Int 206, 35–42. [DOI] [PubMed] [Google Scholar]
  80. Zelditch M, Swiderski DL, Sheets HD (2012) Geometric Morphometrics for Biologists: A Primer. Amsterdam: Elsevier Academic Press. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Fig. S1. Bones of Bombina orientalis (ZMH A12601). Surface render in anterolateral view, anterior to the left.

Fig. S2. Steps to determine the optimal segmentation strategy for the μCT scan of a selected Bombina orientalis specimen (ZMH A12601) automatic local thresholding.

Table S1. Thresholding quality assessed for bone volumes and the adjacent two rows of voxels (first quality measure) of automatic local thresholding trials. The best and second best algorithm–parameter combinations were performed on the reconstructed phantom stack and its resliced derivatives.

Table S2. Landmark definitions and visualizations.

Table S3. Overview of the composition of Landmark Datasets 1–4.

Text S1. Explanations on workflow to generate a synthetic CT volume (‘phantom stack’) and to determine the case‐specific best combination of automatic local thresholding algorithm and parameters.


Articles from Journal of Anatomy are provided here courtesy of Anatomical Society of Great Britain and Ireland

RESOURCES