Abstract
Human recognition and automated image validation are the most widely used approaches to validate the output of binary segmentation methods but, as the number of pixels in an image easily exceeds several million, they become highly demanding from both practical and computational standpoint. We propose a method, called PARSEG, which stands for PArtitioning, Random Selection, Estimation, and Generalization; being the basic steps within this procedure. Suggested method enables us to perform statistical validation of binary images by selecting the minimum number of pixels from the original image to be used for validation without deteriorating the effectiveness of the validation procedure. It utilizes binary classifiers to accomplish image validation and selects the optimal sample of pixels according to a specific objective function. As a result, the computational complexity of the validation experiment is substantially reduced. The procedure’s effectiveness is illustrated by considering images composed of approximately 13 million pixels from the field of seed recognition. PARSEG provides roughly the same precision of the validation process when extended to the entire image, but it utilizes only about 4% of the original number of pixels, thus reducing, by about 90%, the computing time required to validate a binary segmented image.
Keywords: Statistical image validation, Image segmentation, Background subtraction, Big data, Classification, CART, STAPLE, Bootstrap
Subject terms: Computational science, Statistics
Introduction
Images of biological objects, botanic seeds in our case, contain enormous amounts of information, which can be extracted and used as input for the subsequent analyses. To extract a piece of information from an image, it is necessary to preprocess it using the tools of image analysis1. The preprocessing consists of several phases, among which image segmentation is one of the most important as it involves splitting an image into the parts that are strongly associated with real objects of interest2,3. Process of image segmentation constitutes a never ending challenge. Unfortunately, any segmentation methods suffers some drawbacks, typically connected to limited accuracy, excessive complexity and exaggerated time and space requirements. One possible solution, in many situations considered as a golden standard, is the use of human raters. However, it is usually quite costly because it is not easy to train them and, what is worse, to maintain their high uniform level during the long time horizon. Moreover, we can meet both intra- and extra- variability due to their fatigue, especially when dealing with extensive data sets.
A specific case is binary image segmentation, where the image is divided into two parts, called foreground and background, which correspond to the parts we are and are not interested in. Despite the great progress in this field, binary segmentation is still one of the most challenging tasks in image processing, image understanding, artificial intelligence4 and big data5–7.
As segmentation algorithms may lack accuracy and precision, as well as ground truth is frequently missing, assessing their performance is a difficult task. This assessment is of key importance, especially when its output is later analyzed statistically, because the results of statistical analyses are, to a considerable degree, influenced by the quality of the input data. The method, or set of methods, to be used to compare segmentation approaches has not yet been clearly defined; several methods are used in practice. The most common method to assess the quality of image segmentation is the interactive drawing of the image by experts. However, it cannot be considered reliable because, besides intra- and inter-expert variability, it is labour-intensive, subjective, and often suffers from inconsistencies and errors. Alternatively, computer-aided automatic methods can serve this purpose: although they should remove the variability of assessments, they are not always able to provide reliable results. The common problem in characterizing both human experts and automatic methods is that the true segmentation of the image is unknown, particularly in the case of medical images, in which the true segmented image might vary from case to case since the same pathology can appear from different forms or shapes.
One feasible alternative to human recognition is statistical validation of the performance of image segmentation algorithms. In statistics, validation is the task of confirming that the outputs of a statistical model are acceptable with respect to the real data-generating process. In image analysis, statistical validation is a process aimed at confirming that the output of an image segmentation method is accurate. If statistical validation provides reliable results, it is very likely that the considered image segmentation method is, with maximum reliability, able to reproduce the main features of the analyzed image. To account for the above-mentioned drawbacks derived from human recognition, an automatic and effective procedure has been proposed in8. It aims at the statistical validation of the outcomes provided by the binary segmentation of images based on statistical classification algorithms. Such a validation procedure is typically performed on very large data sets, inasmuch as the number of pixels in an image easily exceeds millions. The computational complexity of the validation experiment of segmented images is thus very high. To reduce this complexity, we present here a method called PARSEG, which comprises the following data-processing steps: PArtitioning, Random Selection, Estimation, and Generalization. PARSEG enables us to perform statistical validation of binary images by selecting the minimum number of pixels from the original image to be used for validation without deteriorating the effectiveness of the validation procedure. PARSEG overcomes the computational complexity of statistical image validation. The basic motivation supporting the use of PARSEG is derived from our empirical experiments: the results of statistical validation of binary segmentation methods, obtained by training a classifier on all pixels of the analyzed image, are consistent with those obtained using much smaller randomly selected samples of pixels of a specific size. This equivalence leads to a considerable decrease in the computational complexity of validation for binary segmentation of images comprising millions of pixels when using PARSEG.
The selection of the optimal sample of pixels is derived from a properly selected objective function, which must be minimized to reduce the computational complexity of the validation procedure (see Section “Objective function” for details). Operationally, PARSEG is based on a sampling scheme that allows us to select a reduced number of pixels and, at the same time, preserves a sufficient scope of information needed for the subsequent image validation (see Section “Data partitioning and random subset selection” for details). Firstly, the entire image is partitioned into subsets of pixels of approximately equal size. Secondly, the minimum sample size of pixels to be extracted at random from a single subset is identified. This optimal reduced size should, as much as coherence, preserve the same amount of information as the original (complete) data used in the image validation process. The optimal size is selected via the study of the (functional) relationship comprising variations of possible sample sizes and the predictive performance of an appropriate classifier, selected by the user (see Section “Consistency measure” for details). Next, during the generalization step, validation based on statistical classifiers is performed independently on the remaining subsets using solely a sample of pixels with the previously identified optimal size. Finally, the results obtained from all subsets are combined to assess the validation’s effect on the entire image (see Section “Selection of the optimal sample size” for details).
The effectiveness of PARSEG is demonstrated through examples from the biology of plants, namely, the classification of seeds from the genome bank. Recall that, in the two most recent decades, many specialists in the botanical taxonomy field testified to the growing importance of the biometric features obtained by computer vision techniques employed in the characterization and identification of plant species9–11, varieties12,13, or identification of ancient plants14,15. Within this framework, the main initial point of interest is to correctly separate the pixels into a foreground and a background. Since there is no single method that can be recommended as the preferable one for all types of images, it is necessary to compare different binary segmentation procedures, enabling one to select “the most suitable one”16. This uncertainty is considered in our experiments as the different segmentation methods are ranked w.r.t. their performance from the most to the least accurate (see Section “Giallo Bosa example”).
The paper is organized as follows. Binary thresholding and its statistical validation are concisely discussed in Section “Binary thresholding and assessing its quality via statistical validation”. PARSEG, its main features, objective function, and key procedures are explained in Section “PARSEG”. Section “Comparison between PARSEG and STAPLE” illustrates a comparison between PARSEG and the Simultaneous Truth and Performance Level Estimation algorithm (STAPLE), a similar approach presented in literature. Section “Validating binary segmented seed images” illustrates the results of our approach applied to the analysis of real data (binary-segmented seed images), together with the discussion of the corresponding pros and cons. Finally, Section “Concluding remarks” provides the main conclusions of the paper and Section 7 plans for future work.
Binary thresholding and assessing its quality via statistical validation
In mathematics, an image can be modeled by a continuous function of two variables f(x, y), where (x, y) are the coordinates in the plane (usually pixel indices). If the image is in grayscale, then is a scalar function, and it has three or four dimensions if the image is in a color mode. Depending on the combinations of the primary colors used, it is possible to decide between different color spaces, among which the most common are RGB and CMYK. In this paper, we deal with RGB images. Consequently, , where , and and represent intensities of the red, green and blue color channels for a given pixel (x, y), respectively.
The statistical validation method we propose here can be applied to any image segmentation method. However, for simplicity of our exposition, we focus on one of the most commonly used: grey level thresholding (see17 for an adaptive approach).
Recall that thresholding can be interpreted as a transformation of an image f into a binary image o, where
| 1 |
T(x, y) is the threshold value for pixel (x, y), stands for the foreground pixel, and for the background one1. The main critical task of this method is the selection of a correct threshold, which is essential for a successful segmentation and subsequent analysis. To this purpose, it is possible to use global or local information and, consequently, to decide between global and local thresholding. Global thresholding consists of finding a single threshold T for the entire image, i.e., ; whereas, local thresholding utilizes a threshold value T(x, y) for each pixel separately based on the information about its neighbors.
Our approach to the validation of the output produced by any binary image segmentation method is based on statistical modeling; hence the term statistical validation is used 18. Some approaches to validation (like19) are aimed at defining membership functions based on image descriptors in an alternative to the classical histogram-based image descriptors. Likewise, statistical validation is carried out using a classification experiment whose results are evaluated through a coherence index enabling us to check for the quality of the binary segmentation outcome8.
The main features of a statistical validation experiment in the case of grey-level thresholding segmentation (these features characterize any image segmentation method) are:
The labels assigned by a specific binary image segmentation method, either foreground or background, are used as binary response variables for a statistical classifier. This means that pixels are re-classified into one of the two categories on the basis of the corresponding RGB intensities to derive the “validated labels”.
As for the assessment of the classifier’s performance, it is possible to use a metric that compares pixel-wise observed labels with the predicted ones. This metric might be, in a specific case, that of accuracy, sensitivity, specificity, positive predictive value, or Area Under the ROC Curve (see20,21 for a discussion).
The selected metric is then used to evaluate the quality of the validation experiment by ranking the alternative image segmentation algorithms. The higher the accuracy level of the classifier, or the higher the correspondence between labels obtained from the image segmentation algorithm and label predicted by the classifier, the higher the image segmentation algorithm is ranked. If this is the case, the validation experiment produces satisfactory results and the image segmentation method is considered reliable for the assignment of the “validated” label (background or foreground) to each pixel.
PARSEG
We provide a step-by-step description of PARSEG illustrating every single step and the main issues characterizing the resulting validation experiment.
Objective function
We denote by a sample of pixels of size s randomly drawn from the entire image, and by a pre-specified set of sample sizes ( such that if ) with indicating the total number of pixels in a given image. Let be the index measuring the difference in terms of consistency (i.e., numerical coherence, to be explained in detail in Section “Consistency measure”) between the validation results obtained on and on ; decreases when s increases, and
is the function describing the relationship between s and ; from an empirical study based on our data it emerged that h tends to be monotonically decreasing since monotonically decreases on average when s increases.
The search of the “optimal” minimum sample size, say , is aimed at compensating for the relative increase in complexity observed when moving from to with the relative decrease in the difference . Thus, is defined as
| 2 |
where denotes a derivative of h. In practice, given a set of samples , the optimal point corresponds to that point for which .
Data partitioning and random subset selection
To combine the original RGB image f with the corresponding binary image o:
the N pixels of f are organized into a set : each contains the three values representing RGB color channel intensities of the pixel i;
identical pixels of o are arranged in ;
and are joined to create a new set .
is a collection of N pairs containing the information about both the original pixels of f (the input) and o (the output). Next, is partitioned into M mutually disjoint subsets () using a random sample stratified by . Consequently, the M subsets (of cardinality ) are characterized by a similar distribution of the categories of and an unknown function that maps to .
Validation
To validate a binary segmentation method, one subset is randomly selected and next validated. To reduce computational complexity, a subsample of size , is drawn from , and the pixels in are randomly partitioned into a learning set of cardinality and a validation set of cardinality , such that , , and is the ratio between the two cardinalities.
Next, the pixels of the validation set are validated by computing predicted outcome using an appropriate classifier . The function utilizes the observations of the learning set to train and estimates for the observations in the validation set . In our experiments, although it is possible to consider any alternative metric, sensitivity (sometimes also called the true positive rate, recall, or probability of detection) is used as the reference classifier performance metric since it has been empirically confirmed as a reliable metric in statistical validation experiments. It is defined as
| 3 |
is computed for each possible sample size of the randomly selected subset . Moreover, to take into account model instability, the influence of outliers, and possible variable selection bias, the function in PARSEG is estimated B times for each size , each time with a different random partition of into and . In view of that, for a sample drawn from the partition , the performance of is evaluated in terms of the average sensitivity
| 4 |
Consistency measure
The basic idea supporting PARSEG is the selection of the “optimal” size as the smallest size that ensures for to be consistent with (where n is the total number of elements of ). To measure the difference in terms of consistency between and , we consider the index
| 5 |
where
| 6 |
represent, respectively, the standard deviations of the values and , . Eq. (5) is made up of two terms: evaluates how much the sensitivity obtained for differs from that obtained for , which is the highest one. The second term, , weighs the first term with respect to the higher estimation uncertainty derived from the use of a sub-sample in place of the entire set of observations . For any , an increase in the sample size s is likely to cause the classifier to be more accurate; it means that it will decrease the value of .
Selection of the optimal sample size
The search for through objective function (Eq. 2) should be carried out after estimating for each reduced sample , . To further reduce computational complexity, we consider the efficient approach summarized in Algorithm 1. It requires two user-defined input parameters, l and . The first is the minimum number of sample sizes in which to search for the optimal one in the first iteration, that is, . In iteration i, the optimal sample size is searched for in a subset of possible sample sizes composed of the first elements of plus the maximum size (n); it stops when the same (optimal) sample size is found for consecutive iterations.
Next, the index is computed for each sample size belonging to and the function h describing the relationship between the standardized values of the sample sizes, i.e., , and the standardized values of the index, i.e., , is fitted. The optimal sample size is found by applying the objective function (Eq. 2). If the number of times in which the last optimal sample size is equal to the optimal sample sizes found in the previous iterations, the algorithm stops, otherwise it keeps running.
Algorithm 1.
Selection of the optimal sample size
Once is defined for a given subset , it can be used as the reference sample size for the other subsets because, due to the stratified sampling scheme described in Section “Data partitioning and random subset selection”, the response classes and the RGB intensities have the same distribution as that prevailing in the entire image. In particular, the same distribution of the response classes in the M subsets is guaranteed because the subsets are created by randomly partitioning all pixels with the constraint of having the same proportions of foreground pixels (and consequently also of background ones) as in the entire image. The same distribution of RGB intensities in the M subsets, instead, is deduced from the randomness that regulates the process of assigning the pixels to each subset inasmuch we assume that the pattern describing the relationship between the response classes and the RGB intensities is identical everywhere in the image. Consequently, samples () are drawn from the subsets and the metric (Eq. 3) is computed in each subset . Next, is extended for the entire image by averaging its values over the M subsets
| 7 |
In the next Section, we apply PARSEG to the images of botanic seeds. PARSEG provides roughly the same precision of the validation process extended to the entire image composed of N pixels but, importantly, it consistently reduces the computational complexity from O(N) to with .
It is important to note that the segmentation method to be evaluated has to be carried out at the beginning of the process only. At each step, PARSEG uses solely pixels from the set , which contains the pixel intensities and their corresponding binary outputs defined by the underlying segmentation method.
Comparison between PARSEG and STAPLE
Despite of PARSEG is more concentrated on the computational part of the statistical validation of images with the aim of selecting the best segmentation between those considered, its final goal is to provide a segmentation to be used as the best one. Consequently, in this Section we provide a comparison between the output obtained by PARSEG and that obtained by another method accepted in literature22–25. As evident from the citation report in both Web of Science and Google Scholar, the STAPLE algorithm22 is a widely accepted method for the statistical validation of image segmentation due to its sound theory and ease of use. STAPLE quantifies the performance of image segmentation raters (human or algorithmic) without knowing the true foreground, and is considered particularly useful in cases in which it is difficult to obtain or estimate a known true segmentation. It considers a set of segmentation outputs of an image, and estimates, for each of them, the probability of being the true segmentation. The latter is estimated to create an optimal combination of the segmentation options by weighing them according to their estimated performance level and by incorporating a prior model that considers the spatial distribution of the segmented structures and the spatial homogeneity constraints.
Both STAPLE and PARSEG pursue the goal of finding the best segmentation without knowing the true one: the former by generating a new segmentation from the optimal combination of the original ones, the latter by finding the best segmentation among those available. Furthermore, both methods define a relative performance measure of the original segmentation options according to their proximity to the best one. But, they operate in a different manner: STAPLE identifies the best segmentation by comparing the original segmentation options and the prior information available (if any); PARSEG searches for the patterns that link the original images (i.e., the color channel intensities) to the segmentation options, without referring its analysis to any comparison. Consequently, STAPLE performance could suffer if the segmentation set contains many wrong segmentation outputs and few correct ones. Instead, since PARSEG is not based on a comparison among the segmentation outputs, its performance is not influenced by the presence of a wrong segmentation. However, if the initial segmentations are wrong, neither PARSEG nor STAPLE can improve as is well known not only in statistics, but also in computing and other fields. Incorrect or poor-quality input will produce faulty output (garbage-in, garbage-out).
Concerning the computational requirements, we have assessed that both are linear in the number of segmentation outputs to be evaluated. Moreover, PARSEG is linear in the optimal size times the number of partitions, that is, , whereas STAPLE is linear in the number of pixels N. Being , PARSEG allows for important computational savings.
Validating binary segmented seed images
We present detailed results obtained by applying PARSEG on the images of the seeds of species Giallo Bosa and summarize more concisely the results obtained for a set of sixteen different images of different seed species, including Giallo Bosa. We used data collected in previous studies10,26. The seeds were gathered by the authors of these studies10,26 from 16 traditional Sardinian cultivars from the CNR-ISPA field catalogue (Nuraxinieddu, Sardinia, Italy) (Table 1) and stored at the Banca del Germoplasma Sardo (BG-SAR) of the University of Cagliari. The mature fruits were collected randomly in order to obtain representative samples while reducing the impact of intra-specific variations in seed shapes and sizes caused by fruit position on the plant and seed position within the fruit.
Table 1.
General information about seeds gathering.
| Species | Sampling location | Number | |
|---|---|---|---|
| 1 | Cariadoggia | Alghero | 80 |
| 2 | Cariasina | Medio Campidano | 39 |
| 3 | Coru | Laconi | 55 |
| 4 | Coru e Columbu | Laconi | 80 |
| 5 | Croccorighedda | Laconi | 30 |
| 6 | Fara | Bonarcado | 30 |
| 7 | Giallo Bosa | Bosa | 30 |
| 8 | Laconi A | Laconi | 87 |
| 9 | Melone | Gonnosfanadiga | 77 |
| 10 | Mirabolano Giallo | * | 90 |
| 11 | Mirabolano Rosso | * | 75 |
| 12 | Nero Sardo | Bosa | 99 |
| 13 | San Giovanni | Oristano | 39 |
| 14 | Sanguigna I Bosa | Bosa | 85 |
| 15 | Shiro | * | 94 |
| 16 | Sighera | Gonnosfanadiga | 88 |
*Stands for commercial species.
This data was collected with the goal to develop a suitable methodology allowing us to discriminate between seeds as well as possible. This is an important task from a quality control standpoint: one of the most important ways to enhance food quality is to guarantee the origin of different food products by traceability, which is able to identify responsibilities, optimize the supply chain, and ensure consumer food safety. Simply relying on documentation does not guarantee the truthfulness of the product’s origin. Thus, it is essential to develop instruments that give us a higher degree of reliability. Since seeds are among the most important raw materials in the agri-food market, discrimination among them is crucial to understand their origins.
Giallo Bosa example
The RGB images of the seeds Giallo Bosa are captured twice using a black background and a white background, in both cases without changing the position of the seeds, with a resolution of () pixels. Next, the background subtraction approach is applied, resulting in a new image, serving as an input for binary segmentation algorithms. Recall that background subtraction is a method widely used for detecting moving objects from a video, which has been adapted and modified for image segmentation in8. It combines local and global thresholding techniques to take advantage of the computational efficiency of the former and the accuracy of the latter, provides good results in segmentation, and allows for automating the process when the foreground color of images is not constant. Moreover, it is able to speed up computations quite significantly. All the algorithms listed in Table 2 are applied to separate the foreground, i.e., the seeds, from the background. Since all these algorithms require one-dimensional input, the input image provided by the background subtraction approach is first converted from the RGB to the grey scale (see Fig. 1). Finally, the morphological operators erosion and dilation (described in27) are used to enhance the binary segmentation output’s quality.
Table 2.
The most widespread and the most frequently used binary segmentation algorithms.
| Segmentation algorithm | References | Label |
|---|---|---|
| Adaptive document image binarization | 28 | Sauvola |
| Alternative implementation of Huang’s method | 29 | Huang2 |
| Huang’s fuzzy thresholding method | 30 | Huang |
| Intermodes | 31 | Intermodes |
| Mean of gray levels | 32 | Mean |
| Means of image thresholding | 33 | Shanbhag |
| Minimum | 31 | Minimum |
| Otsu’s threshold | 34 | Otsu |
| Renyi’s entropy threshold | 35 | RenyiEntropy |
| Similarity-invariant pattern recognition | 36 | Percentile |
| Triangle method | 37 | Triangle |
| Tsai’s method | 38 | Moments |
Figure 1.
Image of the Giallo Bosa seeds captured using : (a) black background; (b) white background; (c) image resulting from background subtraction described in8.
To validate the output of the different binary segmentation algorithms with PARSEG, the input parameters are set as follows:
The number of subsets M into which the complete set of pixels is partitioned is set to 40. Concerning M, it is evident that the final sub-images (needed for the analysis) cannot be too small, otherwise they do not contain enough of information. On the other hand, they should not be unnecessarily too large otherwise the procedure becomes computationally too costly. Our numerical experiments show that the size of sub-images 0.3-0.4 MP is suitable for our goals, leading to . Evidently, changing the value of M can influence the results but it should be set (tuned) carefully. On the other hand, if once reasonably set for a class of specific images, it appears that it is not necessary to change it from one image to another.
The number of possible sample sizes is set to 28. Thus, the different sizes range from 100 to pixels. The set of sample sizes is composed of .
For each sample size , , the function is estimated times.
The ratio between the cardinalities of the learning set and validation set is set to 4.
Classification And Regression Trees (CART39) are used as the reference classifier in the validation experiment Note that, in principle, any binary classifier might be used within PARSEG. We use CART as it is flexible, capable of dealing with collinearity effects, detecting complex interaction effects, and processing high dimensional data sets. At the same time, it rarely induces overfitting problems and it is well known for its good predictive capabilities.
The output of the procedure described in Section “PARSEG” aimed at determining the optimal sample size for the image validation experiment is shown in Fig. 2. For each segmentation algorithm, the optimal size is selected according to Eq. (2), and the quality of the validation experiment is measured by computing the average sensitivity metric introduced in Eq. (7). Table 3 provides evidence about the reduction of the execution times induced by the proposed method. The total number of pixels used in the analysis (sampling size) ranges from 2.67% to 3.16% of the total number of pixels composing the entire image, the value depending on the segmentation algorithm. The proposed approach allows us to save from 85% to 93% of the time required to perform statistical validation on the entire segmented image. The time saved is indicated by and computed as follows
| 8 |
where is the time required to validate the results of the binary segmentation carried out on the entire image and is the time required to validate the results of the binary segmentation through PARSEG. The difference in the computational time among segmentation algorithms in our case is due solely to the time needed to estimate the optimal sample size. In particular, the time for estimating the optimal sample size depends on how close the segmentation output obtained by the segmentation algorithm is to the pattern expressed by the color channels. More precisely, if the segmentation output differs substantially from the pattern expressed by the color channels (i.e., the original image), PARSEG needs more time to reach its stopping criterion in the optimal sample size estimation.
Figure 2.
For each segmentation algorithm, the projection of the points of identified by the standardized sample sizes (x-axis), where is the subset of sample sizes needed to find the optimal sample size , and the standardized consistency measures (y-axis). The dashed line represents the cubic spline that estimates their relationship. The solid line identifies the tangent of the cubic spline, i.e, the point where its derivative equals , while the red point has coordinates : it corresponds to the point closest to the tangent line.
Table 3.
Sizes used to perform the proposed approach for each segmentation algorithm and the corresponding computational time obtained for the Giallo Bosa image.
| Segmentation algorithm | Sampling size | Sampling size as % of entire image | Computational time | ||
|---|---|---|---|---|---|
| Sample (opt. size + samples) | Whole | ||||
| Minimum | 339 737 | 2.67% | 32 (9 + 23) | 217 | 85% |
| Intermodes | 339 737 | 2.67% | 32 (9 + 23) | 218 | 85% |
| Otsu | 339 737 | 2.67% | 32 (9 + 23) | 218 | 85% |
| Huang2 | 339 737 | 2.67% | 32 (9 + 23) | 217 | 85% |
| Moments | 339 737 | 2.67% | 33 (10 + 23) | 216 | 85% |
| Sauvola | 339 737 | 2.67% | 33 (10 + 23) | 237 | 86% |
| RenyiEntropy | 339 737 | 2.67% | 38 (15 + 23) | 493 | 92% |
| Shanbhag | 339 737 | 2.67% | 36 (13 + 23) | 373 | 90% |
| Triangle | 347 537 | 2.73% | 38 (15 + 23) | 489 | 92% |
| Mean | 359 337 | 2.82% | 39 (16 + 23) | 486 | 92% |
| Huang | 402 537 | 3.16% | 41 (18 + 23) | 462 | 91% |
| Percentile | 359 337 | 2.82% | 41 (18 + 23) | 578 | 93% |
The second and third columns report the numbers of pixels used and the percentages of pixels of the complete image, respectively. The last three columns show the times (in minutes) needed to carry the analyses out using the proposed approach (sample), on the entire image (whole) and the savings using the proposed approach (). Concerning the proposed approach in brackets the decomposition of the time into its two components: the time needed to selected the optimal sample size (opt. size) and that to carry out the analysis in the remaining samples.
To demonstrate the effectiveness of PARSEG, its performance is compared to that obtained without applying it. To carry out this comparison, the segmentation outcomes of all twelve binary segmentation algorithms are validated using the total number of pixels N. The main results are summarized in Table 4. For both approaches to the validation, the global average sensitivity of the segmentation outputs stemming from the use of different algorithms is sorted in decreasing order. Note that the average sensitivity substantially preserves the same ranking of the segmentation outputs if validation is performed on the entire image or the optimal size is used. Next, the similarity between the two rankings is measured with the rank correlation coefficient 40, an extended version of Kendall’s 41, where ‘X’ stands for extended. The coefficient takes on values in : if the two rankings are identical; if they are perfectly opposed. If no correlation exists between the two rankings, then . In our case, confirms the high similarity between the two rankings. The performance of the two approaches is further described in relative terms (the columns Normalized in Table 4) to simplify their comparison. It is evident that the two approaches can be considered equivalent with respect to the overall quality of the validation experiment. The use of a Spearman correlation coefficient gives very similar results.
Table 4.
Giallo Bosa image: Comparison of the validation of all twelve segmentation outcomes performed on the optimal sample selected by the proposed approach (sample) or on the entire image (whole).
| Segmentation | (Rank) | Normalized | ||
|---|---|---|---|---|
| algorithm | Whole | Sample | Whole | Sample |
| Minimum | 0.999 95 (1) | 0.999 93 (1) | 1.000 00 | 1.000 00 |
| Intermodes | 0.999 94 (3) | 0.999 90 (2) | 0.999 71 | 0.998 96 |
| Otsu | 0.999 94 (2) | 0.999 90 (3) | 0.999 72 | 0.998 87 |
| Huang2 | 0.999 57 (4) | 0.999 26 (4) | 0.982 35 | 0.978 60 |
| Moments | 0.999 53 (5) | 0.999 19 (5) | 0.980 82 | 0.976 53 |
| Sauvola | 0.998 19 (6) | 0.998 16 (6) | 0.918 44 | 0.943 62 |
| RenyiEntropy | 0.998 15 (7) | 0.996 24 (7) | 0.916 91 | 0.882 83 |
| Shanbhag | 0.997 69 (8) | 0.995 12 (8) | 0.895 40 | 0.847 18 |
| Triangle | 0.995 56 (10) | 0.993 46 (9) | 0.796 72 | 0.794 63 |
| Mean | 0.997 37 (9) | 0.991 77 (10) | 0.880 77 | 0.741 07 |
| Huang | 0.991 74 (11) | 0.984 06 (11) | 0.620 15 | 0.496 40 |
| Percentile | 0.978 34 (12) | 0.968 42 (12) | 0.000 00 | 0.000 00 |
The average sensitivities and their ranks (in parenthesis) are reported together with their normalized values obtained by rescaling average sensitivities to [0, 1].
For the sake of completeness, Fig. 3 shows the output obtained from the binary segmentation methods used. The green points correspond to the pixels that have been recognized as the foreground by the specific segmentation algorithm. The images are ordered according to the quality (sensitivity) of the validation experiment. It is worth noticing that, consistent with the results reported in Table 4, the first four segmentation settings provide valuable outputs if compared with the other ones.
Figure 3.

Output of considered segmentation methods obtained for the Giallo Bosa image. Pixels plotted in green correspond to those recognized as foreground by the given segmentation algorithm.
Finally, the performances of PARSEG and STAPLE are compared in Fig. 4, where the best segmentation obtained by the segmentation algorithms for the former and the segmentation output estimated by the latter are shown. Since the true segmentation is unknown, it is impossible to assess which the best method is with no uncertainty, but it appears that the result obtained by PARSEG is clearly better than that obtained by STAPLE.
Figure 4.
Best segmentation obtained by PARSEG and STAPLE methods for the Giallo Bosa image. Pixels plotted in green correspond to those recognized as foreground by the given segmentation algorithm.
We think PARSEG could not work properly in two cases. Firstly, the idea of the statistical validation of image segmentation algorithms behind PARSEG concerns the capability of the statistical classifier to recognize the pattern of separation between background and foreground inside the original image. Consequently, the choice of the statistical classifier is very important and crucial for obtaining satisfying results. Secondly, the operation of PARSEG is regulated by the partition of the data into M subsets characterized by a similar distribution of the categories of and an unknown function that maps to . If the number of pixels is high (as in most cases) we expect with a high level of confidence that stratified random sampling will enforce this condition. If the number of pixels was low, however, the degree of confidence could drop. It is important to note that the former is handled by the researcher, whilst the latter is not.
Results for different types of seeds
The same experiment presented in Section “Giallo Bosa example” is repeated for the other 15 images of different seed species. Table 5 reports the results obtained on all 16 images. It has turned out that the average of the sampling size considering the segmentation algorithms, i.e., the entire set of pixels, ranges from 314 737 to 474 332, reducing the computational complexity induced by PARSEG on average below 4% of the total number of pixels composing the entire image. Specifically, PARSEG allows us to save from 86% to 92% of the time compared to the time required when performing validation using all pixels. The appropriateness of PARSEG is further confirmed by the high values of the coefficient, which range from 0.818 to 0.970.
Table 5.
Results obtained for the images of all sixteen of the analyzed seed species.
| Seed species | N | Average sample size | Average % whole image | Average comput. time | |||
|---|---|---|---|---|---|---|---|
| Sample image | Whole image | ||||||
| Cariadoggia | 12 477 201 | 347 922 | 2.79% | 36.25 | 435.0 | 92% | 0.879 |
| Cariasina | 11 569 761 | 329 669 | 2.85% | 33.17 | 318.5 | 90% | 0.970 |
| Coru | 12 491 721 | 385 793 | 3.09% | 36.33 | 336.8 | 89% | 0.970 |
| Coru e Columbu | 12 090 111 | 418 852 | 3.46% | 43.42 | 344.8 | 87% | 0.909 |
| Croccorighedda | 12 077 091 | 347 902 | 2.88% | 34.67 | 377.4 | 91% | 0.939 |
| Fara | 12 821 281 | 377 349 | 2.94% | 37.50 | 362.0 | 90% | 0.939 |
| Giallo Bosa | 12 727 494 | 348 887 | 2.74% | 35.58 | 350.3 | 90% | 0.939 |
| Laconi A | 12 898 821 | 398 287 | 3.09% | 38.42 | 379.3 | 90% | 0.939 |
| Melone | 12 738 291 | 474 332 | 3.72% | 44.83 | 329.4 | 86% | 0.879 |
| Mirabolano Giallo | 11 206 801 | 314 737 | 2.81% | 32.17 | 331.5 | 90% | 0.879 |
| Mirabolano Rosso | 12 374 131 | 353 261 | 2.85% | 35.08 | 313.8 | 89% | 0.879 |
| Nero Sardo | 13 072 930 | 374 498 | 2.86% | 37.58 | 355.9 | 89% | 0.879 |
| San Giovanni | 10 943 511 | 350 362 | 3.20% | 31.17 | 243.0 | 87% | 0.818 |
| Sanguigna I Bosa | 12 233 641 | 360 474 | 2.95% | 35.33 | 351.5 | 90% | 0.970 |
| Shiro | 13 273 416 | 431 660 | 3.25% | 39.67 | 356.2 | 89% | 0.879 |
| Sighera | 12 333 561 | 369 697 | 3.00% | 36.08 | 370.1 | 90% | 0.909 |
The first column reports the seed species, the second column the numbers of pixels that compose the images, and the third and forth columns indicate the average values of, respectively, the sampling size and the percentage of pixels used from the entire image considering the twelve segmentation algorithms. The fifth to the seventh columns show the average times (in minutes) needed to carry out the analyses using the proposed approach (sample), the entire image (whole) and the percentage decrease in computing time obtained when using the proposed approach (). The last column reports the coefficients computed considering the rankings obtained using the proposed approach and those using all pixels.
Figures 5, 6 and 7 compare the best segmentations obtained by PARSEG and STAPLE for the 15 additional images. PARSEG obtained a better segmentation 11 times over 15 (73%), whilst no important differences in results are observed the remaining four times.
Figure 5.
Best segmentations obtained by PARSEG (on the left) and STAPLE (on the right) for the images: Cariadoggia, Cariasina, Coru, Coru e Columbu, Croccorighedda. Pixels plotted in green correspond to those recognized as foreground by the given segmentation algorithm.
Figure 6.
Best segmentations obtained by PARSEG (on the left) and STAPLE (on the right) for the images: Fara, Laconi A, Melone, Mirabolano Giallo, Mirabolano Rosso. Pixels plotted in green correspond to those recognized as foreground by the given segmentation algorithm.
Figure 7.
Best segmentations obtained by PARSEG (on the left) and STAPLE (on the right) for the images: Nero Sardo, San Giovanni, Sanguigna I Bosa, Shiro, Sighera. Pixels plotted in green correspond to those recognized as foreground by the given segmentation algorithm.
Concluding remarks
To reduce the computational complexity of statistical validation for binary segmented images, PARSEG has been introduced as a novel statistical technique. The suggested approach preserves the performance of the system validation experiment and considerably reduces computational complexity. Its main features are the use of a classifier and related performance metric enabling one to validate the output of binary segmentation algorithms. Although sensitivity has been used as the metric for classifier performance as a viable default choice, it is possible to use different metrics as well. Ability to perform statistical validation on a reduced sample of pixels while providing the same results as when the validation is carried out using all available pixels, the use of smoothing splines to select the reduced optimal sample and the consistent reduction of the computational complexity belong among its main advantages.
We applied PARSEG in a relatively simple framework (the segmentation of seed images). When validating images composed of about 13 million pixels in total, PARSEG used a sample size below 4% of the full image size (on average) to obtain validation results that were fully comparable to those obtained when all pixels were used for validation. As a result, the computing time required to perform image validation using all pixels was reduced by approximately 90%. The advantages of using PARSEG are greater when analyzing images of the same type.
Future work
In this paper we concentrate especially on binary images. In the future, we plan to study in detail two points. The first one is how the suggested approach behaves when segmentation algorithms partition the image into multiple parts and not in a binary way, and what and how should be appropriately modified. The second is to study in detail the influence of different metrics when PARSEG is applied to different types of images.
Acknowledgements
The work of L. Frigau and C. Conversano was supported by Next Generation EU Program and Piano Nazionale di Ripresa e Resilienza (PNRR), EU and Italian Ministry of University, Research Projects “e.INS - Ecosystem of Innovation for Next Generation Sardinia”, cod MUR:ECS00000038 and CUP:F53C22000430001, and “GRINS - Growing Resilient Inclusive and Sustainable”, cod MUR:PE0000018 and CUP:F53C22000760007. The work of L. Frigau was also supported by Fondazione di Sardegna.
Appendix: Alternative metrics: F1 score
In the appendix we provide the results obtained for Giallo Bosa if other metrics were used as the reference classifier performance. In particular, we consider F1 score metric. Table 6 summarizes the results of the comparison of the validation of all twelve segmentation outcomes performed on the optimal sample selected by the proposed approach (sample) or on the entire image (whole). Instead, Fig. 8 shows the output of the procedure described in Section “PARSEG” aimed at determining the optimal sample size for the image validation experiment.
Table 6.
Giallo Bosa image: Comparison of the validation of all twelve segmentation outcomes performed on the optimal sample selected by the proposed approach (sample) or on the entire image (whole). The average F1 score and their ranks (in parenthesis) are reported together with their normalized values obtained by rescaling average F1 score to [0, 1].
| Segmentation algorithm | (Rank) | Normalized | ||
|---|---|---|---|---|
| Whole | Sample | Whole | Sample | |
| Intermodes | 0.999 95 (2) | 0.999 88 (1) | 0.999 94 | 1.000 00 |
| Minimum | 0.999 96 (1) | 0.999 87 (2) | 1.000 00 | 0.999 63 |
| Otsu | 0.999 95 (3) | 0.999 87 (3) | 0.999 89 | 0.999 61 |
| Huang2 | 0.999 73 (4) | 0.999 52 (4) | 0.989 76 | 0.988 66 |
| Sauvola | 0.999 06 (6) | 0.998 87 (5) | 0.959 90 | 0.968 42 |
| Moments | 0.999 47 (5) | 0.998 39 (6) | 0.978 24 | 0.953 49 |
| RenyiEntropy | 0.997 45 (8) | 0.996 15 (7) | 0.887 40 | 0.884 01 |
| Shanbhag | 0.998 42 (7) | 0.995 95 (8) | 0.930 90 | 0.877 73 |
| Triangle | 0.996 57 (9) | 0.993 34 (9) | 0.847 85 | 0.796 58 |
| Mean | 0.996 36 (10) | 0.992 40 (10) | 0.838 18 | 0.767 36 |
| Huang | 0.988 31 (11) | 0.983 23 (11) | 0.476 12 | 0.482 41 |
| Percentile | 0.977 72 (12) | 0.967 71 (12) | 0.000 00 | 0.000 00 |
Figure 8.
For each segmentation algorithm, the projection of the points of identified by the standardized sample sizes (x-axis), where is the subset of sample sizes needed to find the optimal sample size , and the standardized consistency measures (y-axis). The dashed line represents the cubic spline that estimates their relationship. The solid line identifies the tangent of the cubic spline, i.e, the point where its derivative equals , while the red point has coordinates : it corresponds to the point closest to the tangent line.
Author contribution
All authors contributed to the conceptualization of the paper. L.F. contributed to the methodology and conducted the analysis. All authors wrote the main manuscript text. J.A. and C.C. reviewed the manuscript.
Funding
The work of J. Antoch was partially supported by the Czech Science Foundation under the Grant number P403/22/19353S.
Data availability
Data and source code of the analysis in R are available from the authors on request by contacting the corresponding author at frigau@unica.it
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Šonka M, Hlaváč V, Boyle R. Image Processing, Analysis, and Machine Vision. Cengage Learning; 2014. [Google Scholar]
- 2.Glasbey C, Horgan G. Image Analysis for the Biological Sciences. Wiley; 1995. [Google Scholar]
- 3.Tunák M, et al. Estimation of fiber system orientation for nonwoven and nanofibrous layers: Local approach based on image analysis. Textile Res. J. 2014;88:989–1006. doi: 10.1177/0040517513509852. [DOI] [Google Scholar]
- 4.Chan T, Shen J. Image Processing and Analysis: Variational, PDE, Wavelet, and Stochastic Methods. Philadelphia: SIAM; 2005. [Google Scholar]
- 5.Ding J, Hu X, Gudivada V. A machine learning based framework for verification and validation of massive scale image data. IEEE Trans. Big Data. 2021;7:451–467. doi: 10.1109/TBDATA.2017.2680460. [DOI] [Google Scholar]
- 6.Liu B, et al. A spark-based parallel fuzzy -means segmentation algorithm for agricultural image big data. IEEE Access. 2019;7:42169–42180. doi: 10.1109/ACCESS.2019.2907573. [DOI] [Google Scholar]
- 7.Men K, et al. Fully automatic and robust segmentation of the clinical target volume for radiotherapy of breast cancer using big data and deep learning. Phys. Med. 2018;50:13–19. doi: 10.1016/j.ejmp.2018.05.006. [DOI] [PubMed] [Google Scholar]
- 8.Mola F, et al. Classification of images background subtraction in image segmentation. Acta Univ. Palackianae Olomucensis Math. 2016;55:73–86. [Google Scholar]
- 9.Appelhans M, et al. Phylogeny, evolutionary trends and classification of the Spathelia–Ptaeroxylon clade: Morphological and molecular insights. Ann. Bot. 2011;107:1259–1277. doi: 10.1093/aob/mcr076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Frigau L, et al. A statistical approach to the morphological classification of Prunus sp. seeds. Plant Biosyst. 2020;154:877–886. doi: 10.1080/11263504.2019.1701126. [DOI] [Google Scholar]
- 11.Herridge R, et al. Rapid analysis of seed size in arabidopsis for mutant and QTL discovery. Plant Methods. 2011;7:3. doi: 10.1186/1746-4811-7-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Smykalova I, et al. Morpho-colorimetric traits of pisum seeds measured by an image analysis system. Seed Sci. Technol. 2011;39:612–626. doi: 10.15258/sst.2011.39.3.08. [DOI] [Google Scholar]
- 13.Piras F, et al. Effectiveness of a computer vision technique in the characterization of wild and farmed olives. Comput. Electron. Agric. 2016;122:86–93. doi: 10.1016/j.compag.2016.01.021. [DOI] [Google Scholar]
- 14.Bouby L, et al. Bioarchaeological insights into the process of domestication of grapevine (Vitis vinifera L.) during Roman times in southern France. PLoS ONE. 2013;8:e63195. doi: 10.1371/journal.pone.0063195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ucchesu M, et al. Predictive method for correct identification of archaeological charred grape seeds: Support for advances in knowledge of grape domestication process. PloS ONE. 2016;11:e0149814. doi: 10.1371/journal.pone.0149814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Muñoz X, et al. Strategies for image segmentation combining region and boundary information. Pattern Recognit. Lett. 2003;24:375–392. doi: 10.1016/S0167-8655(02)00262-3. [DOI] [Google Scholar]
- 17.Yanowitz S, Bruckstein A. A new method for image segmentation. Comput. Vis. Graph. Image Process. 1989;46:82–95. doi: 10.1016/S0734-189X(89)80017-9. [DOI] [Google Scholar]
- 18.Mayer D, Butler D. Statistical validation. Ecol. Model. 1993;68:21–32. doi: 10.1016/0304-3800(93)90105-2. [DOI] [Google Scholar]
- 19.Kumar M, et al. Fuzzy theoretic model based analysis of image features. Inf. Sci. 2019;480:34–54. doi: 10.1016/j.ins.2018.12.024. [DOI] [Google Scholar]
- 20.Antoch J, Prchal L, Sarda P. Combining association measures for collocation extraction using clustering of receiver operating characteristic curves. J. Classif. 2013;30:100–123. doi: 10.1007/s00357-013-9123-x. [DOI] [Google Scholar]
- 21.Powers D. Evaluation: From precision, recall and f-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2011;2:37–63. [Google Scholar]
- 22.Warfield S, Zou K, Wells W. Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging. 2004;23:903–921. doi: 10.1109/TMI.2004.828354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Taha A, Hanbury A. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Med. Imaging. 2015;15:1–28. doi: 10.1186/s12880-015-0068-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Yushkevich P, et al. User-guided 3D active contour segmentation of anatomical structures: Significantly improved efficiency and reliability. NeuroImage. 2006;31:1116–1128. doi: 10.1016/j.neuroimage.2006.01.015. [DOI] [PubMed] [Google Scholar]
- 25.Zou K, et al. Statistical validation of image segmentation quality based on a spatial overlap index. Acad. Radiol. 2004;11:178–189. doi: 10.1016/S1076-6332(03)00671-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Bacchetta G, Grillo O, Mattana E, Venora G. Morpho-colorimetric characterization by image analysis to identify diaspores of wild plant species. Flora-Morphol. Distrib. Funct. Ecol. Plants. 2008;203:669–682. doi: 10.1016/j.flora.2007.11.004. [DOI] [Google Scholar]
- 27.Serra J. Image Analysis and Mathematical Morphology. Academic Press; 1982. [Google Scholar]
- 28.Sauvola J, Pietikäinen M. Adaptive document image binarization. Pattern Recognit. 2000;33:225–236. doi: 10.1016/S0031-3203(99)00055-2. [DOI] [Google Scholar]
- 29.Schindelin J, et al. FIJI: An open-source platform for biological-image analysis. Nat. Methods. 2012;9:676. doi: 10.1038/nmeth.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Huang L, Wang M. Image thresholding by minimizing the measures of fuzziness. Pattern Recognit. 1995;28:41–51. doi: 10.1016/0031-3203(94)E0043-K. [DOI] [Google Scholar]
- 31.Prewitt J, Mendelsohn M. The analysis of cell images. Ann. N. Y. Acad. Sci. 1966;128:1035–1053. doi: 10.1111/j.1749-6632.1965.tb11715.x. [DOI] [PubMed] [Google Scholar]
- 32.Glasbey C. An analysis of histogram-based thresholding algorithms. CVGIP: Graph. Models Image Process. 1993;55:532–537. [Google Scholar]
- 33.Shanbhag A. Utilization of information measure as a means of image thresholding. CVGIP: Graph. Models Image Process. 1994;56:414–419. [Google Scholar]
- 34.Otsu N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 1979;9:62–66. doi: 10.1109/TSMC.1979.4310076. [DOI] [Google Scholar]
- 35.Kapur J, Sahoo P, Wong A. A new method for gray-level picture thresholding using the entropy of the histogram. Comput. Vis. Graph. Image Process. 1985;29:273–285. doi: 10.1016/0734-189X(85)90125-2. [DOI] [Google Scholar]
- 36.Doyle W. Operations useful for similarity-invariant pattern recognition. J. ACM. 1962;9:259–267. doi: 10.1145/321119.321123. [DOI] [Google Scholar]
- 37.Zack G, Rogers W, Latt S. Automatic measurement of sister chromatid exchange frequency. J. Histochem. Cytochem. 1977;25:741–753. doi: 10.1177/25.7.70454. [DOI] [PubMed] [Google Scholar]
- 38.Tsai W-H, et al. Moment preserving thresholding. A new approach. Comput. Vis. Graph. Image Process. 1985;29:377–393. doi: 10.1016/0734-189X(85)90133-1. [DOI] [Google Scholar]
- 39.Breiman L, et al. Classification and Regression Trees. Chapman & Hall; 1984. [Google Scholar]
- 40.Emond E, Mason D. A new rank correlation coefficient with application to the consensus ranking problem. J. Multi-criteria Decis. Anal. 2002;11:17–28. doi: 10.1002/mcda.313. [DOI] [Google Scholar]
- 41.Kendall MG. Rank Correlation Methods. Griffin; 1948. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data and source code of the analysis in R are available from the authors on request by contacting the corresponding author at frigau@unica.it








