Analyzing the fine structure of distributions

Michael C Thrun; Tino Gehlert; Alfred Ultsch

doi:10.1371/journal.pone.0238835

. 2020 Oct 14;15(10):e0238835. doi: 10.1371/journal.pone.0238835

Analyzing the fine structure of distributions

Michael C Thrun ^1,^2,^*, Tino Gehlert ³, Alfred Ultsch ¹

Editor: Fatemeh Vafaee⁴

PMCID: PMC7556505 PMID: 33052923

Abstract

One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to different states of the data producing process. Data visualization tools should deliver a clear picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs typically use kernel density estimates and include both the classical histogram, as well as the modern tools like ridgeline plots, bean plots and violin plots. If density estimation parameters remain in a default setting, conventional methods pose several problems when visualizing the PDF of uniform, multimodal, skewed distributions and distributions with clipped data, For that reason, a new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed. The MD plot does not require adjusting any parameters of density estimation, which is what may make the use of this plot compelling particularly to non-experts. The visualization tools in question are evaluated against statistical tests with regard to typical challenges of explorative distribution analysis. The results of the evaluation are presented using bimodal Gaussian, skewed distributions and several features with already published PDFs. In an exploratory data analysis of 12 features describing quarterly financial statements, when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs. In sum, the MD plot outperforms the above mentioned methods.

Introduction

In exploratory distribution analysis, it is essential to investigate the structures of continuous features and to ensure that such investigations do not mislead researchers and cause making false assumptions. When given one feature in the data space, there are several approaches available to evaluating univariate structures, using indications of the quantity and range of values. These approaches include quantile-quantile plots [1, 2], histograms or cumulative density functions, and probability density functions (PDFs). When the goal is to evaluate many features simultaneously, four approaches are of particular interest: the Box-Whisker diagram (box plot) [3], the violin plot [4], the bean plot [5] and the ridgeline plot [6]. Since the box plot and it’s counterpart the range bar [7], together with its extension, the notched box plot [8], are nearly unable to visualize multimodality [3] they are disregarded in this work. On the other hand, the violin plot, as suggested by the name, was specifically intended to identify multimodality by exposing the waist between two modes of distribution.

In exploratory statistics, univariate density estimation is a challenging task, especially to non-experts in the field. In fact, changing the default parameters of the available software, such as the bandwidth and kernel density estimator, can not only lead to better results but also to worse ones, using the abovementioned methods. Morover, both in a strictly exploratory setting and when evaluating quality measures for supervised or unsupervised machine learning methods, it is difficult to set those parameters without having a prior model of the data or results of the evaluation. Hence, nonexperts typically use the default option. On the one hand, it is a challenging task to consider the intrinsic assumptions of common density estimate approaches, which leads to opt for using the most common methods in their default setting. On the other hand, “wisely used, graphical representations can be extremely effective in making large amounts of certain kinds of numerical information rapidly available to people” [9], p. 375.

When the default parameter settings are used, the schematic plots of violin plots, bean plots, ridgeline plots and histograms provide misleading visualizations that will be illustrated for several bodies of data. Thus, it is necessary to develop a new graphical tool that enables a better understanding of the data at hand. This work proposes a strictly data-driven schematic plot, called the mirrored-density plot, based on Pareto density estimation (PDE). The PDE approach is particularly suitable for detecting structures in continuous data and in addition its kernel density estimation does not require any parameters to be set. The MD plot is compared with conventional methods, like violin plots, bean plots, ridgeline plots and histograms. This work will show that, for multimodal or skewed distributions, the MD plot is able to investigate distributions of data with more sensitivity than conventional methods. Statistical testing will be used as an indicator of the sensitivity of all the methods in terms of skewness and multimodality. For exploratory data analysis in a high-dimensional case, descriptive statistics will be used to show that the bean plot, unlike the MD plot, gives misleading visualizations.

Methods

The methods section is divided into three parts. First, we outline how the performance of visualization tools is investigated. The focus of interest in this work lies in a separate visualization of basic properties of the empirical distribution of each feature, which means that our interest is restricted to univariate density estimation and visualizations that can present more than one feature in one plot. Such approaches are usually called schematic plots. The best-known representative is the box-whisker diagram (box plot) [3]. However, box plots are unable to visualize multimodality (e.g., [10]) and are therefore not investigated herein. In the second section, we introduce and compare the visualization tools. In the last section, we introduce the MD plot.

Performance comparison

In this work, three steps of comparison are applied. First, artificial features are generated by taking specifically defined sampling approaches. Thus, the basic properties of the investigated distributions are well defined, as long as the sample size is not too small. For the artificial datasets in the case of skewness and bimodality, samples are chosen for their maximum size allowable for exact statistical testing. On the other hand, the minimum size is chosen for the artificial dataset of the uniform distribution for which a QQ plot against the uniform distribution would indicate a straight line. The here investigated sample sizes for natural and artificial datasets range from 269 to 31.000. The implicit assumption of this work is that with this range of sample sizes, it is not probable that the results of the compared methods will change. In the case of the MD plot, the underlying Pareto density estimation is well-investigated for varying sample sizes [11]. To account for variance in sampling, we perform 100 iterations of sampling and test the artificial datasets for multimodality and skewness to visualize them with schematic plots.

The sensitivity for multimodality is compared with Hartigan’s dip statistic [12] because it has the highest sensitivity in distinguishing unimodality from nonunimodality when compared to other approaches [13]. For skewness, the D'Agostino test of skewness [14] is used to distinguish skewed distributions from normal distributions. In the next step, natural features are selected and the basic properties of the empirical distributions are already known. The first and second steps outline the challenges the conventional methods face.

In the last step, we exploratively investigate a new dataset containing several features with unknown basic properties to summarize the problems with visualizing the estimated probability density function. In such a typical data mining setting, it would be a very challenging task to adjust the parameters of conventional visualization tools. For example, when visualizing high-dimensional data, one is unable to set the parameters of a method correctly because the appropriate adjustments are unknown (e.g., p.42, Fig 5.2 in [15]) or one sets the parameters for a specific dataset correctly because the option is known beforehand [16]. However, this option automatically becomes inappropriate for a dataset with unsimilar properties (e.g., p.8, Fig 7, p. [16] see also the example in S6 File, section 4). In sum, the right choice of parameters is interrelated with the properties of data that are unknown in an unsupervised or exploratory data mining task. Table 1 summarizes the interesting basic properties from the perspective of data mining and the methods used to compare the performance of different methods. Extensive knowledge discovery for this dataset was performed in [17]. Therefore, we compare the visualizations with basic descriptive statistics and show which visualization tools do not visualize the shapes of the PDF accurately without changing the default parameters of the investigated visualization methods.

Table 1. Summary of basic properties of empirical distributions that are interesting for data mining.

Interesting basic Properties	Exemplary data mining applications	Statistical test used	Descriptive Statistic
Uniformity versus multimodality	Biomedical data [22], Water vapor [23]	Hartigan’s dip test [12]	Difference between mean and median can indicate multimodality, several coefficients [23]
Data clipping versus heavy-tailedness	Flood data [24], Upper Income [25]	Not required here, but we can refer to [24, 26]	Range of data is sufficient for the task. “There is no easy characteristic for heavy-tailedness” [27]
Skewness versus normality	Biomedical data [28], Strength of Glass Fibers & Market Value Growth [29]	D'Agostino test [14]	Third order statistics, for example [28]

Open in a new tab

Comparing visualizations is challenging because they have the same problems as the estimation of quantiles or clustering algorithms such as k-means or Ward: they depend on the specific implementation (c.f. [18–21]). Therefore, this work restricts the comparison to several conventional methods and specifies the programming language, package and PDF estimation approach used to outline several relevant problems for visualization of the basic properties of the PDF. To ensure that the MD plot introduced herein does not depend on a specific implementation, we use two different programming languages (R and Python), and the results from R presented herein are reproduced in the Python tutorial attached to this work.

Visualization tools

Usually, univariate density estimation is either based on finite mixture models or variable kernel estimates or uniform kernel estimates [11]. Finite mixture models attempt to find a superposition of parameterized functions, typically Gaussians, that best account for the data [30]. In the case of kernel-based approaches, the actual probability density function is estimated using local approximations [30]: the local approximations are parameterized in such a way that only data points within a certain distance of a selected point influence the shape of the kernel function, which is called the (band-)width or radius of the kernel [30]. Variable kernel methods have the capacity to adjust the radius of the kernel, and uniform kernel algorithms use a fixed global radius [30]. Histograms use a fixed global radius to define the width of a bin (binwidth). The binwidth parameter is critical for the visualized basic properties of the PDF, and in this work, only the default parameter will be used for the reason that non-experts might not adjust the parameters on their own. However, there are approaches available for a more elaborate option depending on the intrinsic assumptions about the data (e.g., [31]). As an example, we use histograms of plotly [32], which can be used in either R or MATLAB or Python. This work concentrates on visualizing the estimated probability density distribution (PDF), which will be called the distribution of the feature (variable).

The first variation visualizing the PDF was the vase plot [33], where the box of a box plot is replaced by a symmetrical display of estimated density [10]. The box plot itself visualizes only the statistical summary of a feature. A further amendment was the violin plot, which mirrors an estimated PDF so that the visualization looks similar to a box plot. “The bean plot [5] is a further enhancement that adds a rug that is showing every value and a line that shows the mean. The appearance of the plot inspires the name: the shape of the density looks like the outside of a bean pod, and the rug plot looks like the seeds within” [10].

The violin plot [4] uses a nonparametric density estimation based on a smooth kernel function with a fixed global radius [34]. The R package ‘vioplot’ on CRAN [35] serves as a representative for this work and uses the density estimation with the bandwidth defined by a Gaussian variance of the R package 'sm' on CRAN [36]. Another commonly applied estimation method is using the density estimation of the R package 'stats' [37], where the bandwidth is usually computed by estimating the mean integrated square error [38], nevertheless, several other approaches can be chosen as well.

An alternative to the “vioplot” is the geom_violin method [4] of the well-known “ggplot2” package [39] presented in S6 File, which uses the density estimation specified in [37]. In contrast to the violin plots, the bean plot in the R package ‘beanplot’ on CRAN [5] redefines the bandwidth [40]. As noted by Bowman and Azzalini, the density estimation critically depends on the choice of the width of the kernel function [34].

Yet another approach are ridgeline plots. “Ridgeline plots are partially overlapping line plots that create the impression of a mountain range” [41]. In R, they are available in the ggridges packages on CRAN [41] and either use the density estimation approaches of R discussed above (if set manually) or the default setting which “estimates the data range and bandwidth for the density estimation from the entire data at once, rather than from each individual group of data” [41]. The default setting is used in this work.

One of the most common ways to create a violin plot in Python is to use the visualization package ‘seaborn’ [42], which extends the Python package ‘matplotlib’ by statistical plots such as the violin plot. Seaborn uses Gaussian kernels for kernel density estimation from the Python package ‘scipy’ [43], where the bandwidth is set to Scott’s Rule by default (see https://github.com/scipy/scipy/blob/v1.3.0/scipy/stats/kde.py#L43-L637) [30]. The density plots and ridgeline plots in Python, presented in supplementary E, are created by using the ‘kdeplot’ function of the ‘seaborn’ package. This approach uses the density estimation by Racine [44] implemented in the ‘statsmodel’ package [45] if it is installed. If it is not installed, the density estimation of ‘scipy’ is used.

Mirrored Density plot (MD plot)

A special case of uniform kernel estimates is the density estimation using the number of points within a hypersphere of a fixed radius around each given data point. In this case, the number of points within a hypersphere of each data point is used for the density estimation at the center of the hypersphere. In “Pareto density estimation (PDE), the radius for hypersphere density estimation is chosen optimally [with respect to] information theoretic ideas” [11]. Information optimization calls for a radius that enables the hyperspheres to contain a maximum of information using minimal volume [11]. If a hypersphere contains approximately 20% of the data on average, it is the source of more than 80% of the possible information any subset of data can have [11]. PDE is particularly suitable for the discovery of structures in continuous data and allows for the discovery of mixtures of Gaussians [22].

For this work, the general idea of mirroring the PDF in a visualization is combined with the PDE approach to density estimation resulting in the Mirrored-Density plot (MD plot). Using the theoretical insights of [11] for the Pareto radius and [31] for the number of kernels, the PDE algorithm is implemented in the package ‘DataVisualizations’ on CRAN [46] and in addition independently implemented in Python [47]. To provide an easy-to-use method for non-experts, the MD plot allows for an investigation of the distributions of many features (variables) after common transformations (symmetric log, robust normalization [48], percentage) with automatic sampling in the case of large datasets and several statistical tests for normal distributions. If all tests agree that a feature is Gaussian distributed, then the plot of the feature is automatically overlaid with a normal distribution of robustly estimated mean and variance equal to the data. This step allows the marking of possible non-Gaussian distributions of single feature investigations with a quantile-quantile plot in cases where statistical testing may be insensitive. In the default mode, the features are ordered by convex, concave, unimodal, and nonunimodal “distribution shapes”.

The MD plot performs no density estimation below a threshold defining the minimal amount of unique data. Instead, a 1D scatter plot (rug plot) is visualized in which for each unique value, the points are jittered on the horizontal (y-)axis to indicate the number of points per unique value. Another threshold defines the minimal amount of values in the data below which a 1D scatter plot is presented instead of a density estimation. The default setting of both thresholds can be changed or disabled by the user if necessary. These thresholds are advantageous in case of a varying amount of missing data per feature or if the benchmarking of algorithms yields quantized error states in specific cases (S6 File, section 5).

The MD plot can be applied by installing the R package ‘DataVisualizations’ on CRAN [46] in the framework of ggplot2 [39]. The Python implementation of the MD plot is provided in the Python package ‘md_plot’ on PyPi [47]. The vignettes describing the usage and providing the data are attached to this work for the two most common data science programming languages, namely, Python and R. In the next section, the visual performance indicating the correct distribution of features is investigated by a ridge line plot, a violin plot, a bean plot and a histogram and compared against an MD plot.

Results

Initially, a random sample of 1000 points of a uniform distribution was drawn and visualized by a commonly used ridgeline, violin, bean, MD plot (Fig 1) and histogram (Fig A in S4 File) and in the corresponding methods in Python (Fig A in S3 File, Fig A in S5 File). In PDF visualizations of a uniform distribution, a straight line is expected, with possible minor fluctuations depending on the random number generator used (range [-2,2], generated with R 3.5.1, runif function). Contrary to expectations, the ridgeline plot, histogram and bean plot indicate multimodality, and the bean plot, ridgeline plot, and violin plot bend the PDF line in the direction of the end points. The visualization of this sample in Python with the package ‘seaborn’ [42] shows a tendency towards multimodality (Fig A in S3 File). Hartigan’s dip test [12] and D'Agostino test of skewness [14] yield p(N = 1 000, D = 0.01215) = 0.44 and p(N = 1 000, z = 0.59) = 0.55, respectively, indicating that this sample is unimodal and not skewed.

Fig 1 — Uniform distribution in the interval [−2,2] of a 1 000 points sample visualized by a ridgeline plot (a) of ggridges on CRAN [41] (top left) and violin plot (b, top right), bottom: bean plot (d, right) and MD plot (c, left). In the ridgeline, violin and bean plot, the borders of the uniform distribution are skewed contrary to the real amount of values around the borders 2,−2. The bean plot and ridgeline plot indicate multimodality but Hartigan’s dip statistic [12] disagrees: p(n = 1 000,D = 0. 01215) = 0.44.

As a consequence, several experiments and one exploratory investigation of a high-dimensional dataset are performed. The first two experiments investigate the multimodality and skewness of the data. The third experiment investigates the clipping of data, which is often used in data science. The fourth experiment uses a well-investigated clipped feature that is log-normal distributed and possesses several modes [49]. In the exploratory investigation, descriptive statistics in a high-dimensional case are used to outline major differences between the bean plot and the MD plot. In the last experiment, the effect of the range of values on the schematic plots is outlined.

Experiment I: Multimodality versus unimodality

Two Gaussians, where the mean of one is changed, were used to investigate the sensitivity for bimodality in the ridgeline, violin bean, MD plot and histogram (Fig B in S4 File), as well as in Python (Fig B in S5 File). For each Gaussian randomized sample, 15 500 points were drawn,. The sample consisting of the first Gaussian N (m = 0, s = 1) remained unchanged (for definition of Gaussian mixtures please see [50]), and the second Gaussian N (m = i, s = 1) changed its mean through a range of values. Vividly, the distance between the two modes of a Gaussian mixture varies with each change of the mean of the second Gaussian. For statistical testing with Hartigan’s dip test, 100 iterations were performed to take the variance of the random number generators and statistical method into account. Fig 2 shows that starting with a mean of 2.4, a significant p-value of approximately 0.05 is probable, and starting with a mean of 2.5, every p-value will be below 0.01.

Fig 2 — The visualization is restricted to the median and 99 percentile of the p-values for each x value. The test of Hartigan’s dip statistic is highly significant for a mean higher than 2.4 in a sample of size n = 31.000.

This result is visualized in Fig 3. The bimodality is visible in the ridgeline plot and bean plot starting with a mean equal to 2.4 and in the MD plot starting with a mean equal to 2.4. However, a robustly estimated Gaussian in magenta is overlaid on the MD plot, making bimodality visible starting with a mean of 2.2. The Hartigan dip statistic [12] is consistent with these two schematic plots. In contrast, the violin plots examined here, except for geom_violin of ggplot2 (see S6 File), do not show a bimodal distribution (Fig 3), while the Python violin plots and ridgeline plots show the bimodality starting with a mean equal to 2.4 (Fig B in S3 File, Fig B in S5 File). Histograms are less sensitive, showing a bimodal distribution beginning with a mean of 2.5.

Experiment II: Skewness versus normality

Next, an artificial feature of a skewed normal distribution is generated by the sampling method of the R package ‘fGarch’ available on CRAN [51]. For the skewed Gaussian, large randomized samples of 15 000 points were drawn for each value of the skewness parameter. The case of N(m = 0,s = 1,xi = 1) defines the uniform Gaussian distribution (for definition of Gaussian please see [50], skewed distributions [51]). One hundred iterations were performed, and the D'Agostino test of skewness [14] revealed no significant results (for skewness) in a range of [0. 95,1.05] in Fig 4. Skewness is visible in the bean plot and MD plot (Fig 5) but not in the violin plot. Unlike the R version, the skewness is visible in the Python version of the violin plot (Fig C in S3 File, Fig C in S5 File) but is slightly less sensitive than the bean plot and MD plot. In the histogram, the skewness of the distribution is difficult to recognize (Fig C in S4 File). The bean plot and MD plot are slightly less sensitive with regard to skewed distributions compared to statistical testing (Fig 4).

Fig 4 — The visualization is restricted to the median and 99 percentile of the p-values for each x value. The D'Agostino test of skewness [14] was highly significant for skewness outside of the range of [0.95,1.05] in a sample of n = 15.000. Scatter plots were generated with plotly [32].

Fig 5 — Plots of skewed normal distribution with different skewness using the R package fGarch [51] on CRAN: Ridgeline plots (a) of ggridges on CRAN [41], violin plot (b), bean plot (c) and MD plot (d). The sample size is n = 15000. The violin plot is less sensitive to the skewness of the distribution. The MD plot allows for an easier detection of skewness by ordering the columns automatically.

Experiment III: Data clipping versus heavy-tailedness

The municipality income tax yield (MTY) of German municipalities of 2015 [46, 52] serves as an example for data clipping in which the comparison will be restricted to bean plots and MD plots. MTY is unimodal. Hartigan’s dip statistic is consistent with the assessment that MTY is unimodal, p(n = 11194, D = 0.0020678) = 0.99, [12]. The bean plot has a major limitation for clipped data, Fig 6 shows that it estimates nonexistent distribution tails and visualizes a density above and below the range of the clipping [1800,6000]. This issue can also be observed in the Python violin and density plots (Fig D in S3 File, Fig D in S5 File).

Fig 6 — The bean plot (a) underestimates the density in the direction of the clipped range [1800, 6000] and draws a density outside of the range of values. Additionally, this leads to the misleading interpretation that the average lies at 4000 instead of 4300. The MD plot (b) visualizes the density independently of the clipping. Note that for a better comparison, we disabled the additional overlaying plots.

Experiment IV: Combining multimodality and skewness with data clipping

Here, one feature is used to compare the histogram and the schematic plots against each other. The feature is the income of German population from 2003 [49]. The whole feature was modeled with a Gaussian mixture model on the log scale and verified with the Xi-quadrat-test (p < .001) and QQ plot [49]. A sample of 500 cases was taken and the PDF of the sample was skewed on the log scale in accordance with the D'Agostino skewness test (skew = -1.73, p-value p(N = 500, z = -22.4) < 2.2e -16, [14]).

In Fig 8, it is visible that the violin plot, contrary to the MD plot, underestimates the skewness of the distribution. In addition, the violin, ridgeline and bean plots show a mode between 4 and 4.5 in the skewed distribution, (Figs 7 and 8). In Fig D of S4 File, the histogram is consistent with the MD plot and inconsistent with the bean plot, indicating that there are no values above 4.35; this means that the ridgeline and bean plot visualize a PDF above the maximum value (marked with red lines). Thus, similar to experiment III, the bean plot incorrectly visualizes a density above the maximum possible value of 4.35 with a strong tendency to underestimate it toward the maximum value, whereas the MD plot estimates density correctly (c.f. visualizations in [46]). Similar to the bean plot, the Python density function and the violin plot show values above 4.35 however they smooth the distribution more (Fig E in S3 File, Fig E in S5 File) hence these plots do not indicate multimodality.

Fig 7 — Distribution analyses performed on the log of German population’s income in 2003 with ridgeline plots (a) of ggridges on CRAN (37) do not indicate clipping or multimodality.

Experiment V: Visual exploration of distributions

The high-dimensional dataset (d = 45) of quarterly statements of companies listed on the German stock market is investigated by selecting 12 example features. It should be noted that the other features have a similar effect, but more features would make this example harder to understand. In line with the prime standard of “Deutsche Börse” [53] these companies are required to report their balance and cash flow regularly every three months in a standardized way, which are then accessible in [54]. Using web scraping, the information of n = 269 cases were extracted. In such a high-dimensional case, statistical testing, parameter settings, usual density plots and histograms become very troublesome and thus are omitted in this work. Moreover, integrating different ranges in one visualization also poses a challenge. In Table A in S2 File, the order of the descriptive statistics of the features from top to bottom is the same as in the MD plot, ridgeline plot and bean plot from left to right.(Fig 9) The MD plot enables a concave ordering, which is used here. The MD plot (Fig 9), the bean plot (Fig 10A) and the ridgeline plot (Fig 10B) visualize all features in one picture. Table A in S2 File shows that six features from right to left do not possess more than 1% negative values. Fifty percent of the data for “net tangible assets” and “total cash flow from operating activities” lie in a small positive range. “Interest expense” and “capital expenditures” do not have more than 1% positive values. “Net income” has only 25% of data below zero, and “treasury stock” has the second largest kurtosis of the selected features.

Fig 9 — The features are concave ordered and the same as in Fig 10 and Table A in S2 File. For 8 out of 12 distributions, there is a hard cut at the value zero which overlaps with Table A in S2 File. The features are highly skewed besides net tangible assets, total assets, and total stockholder equity. The latter two are multimodal.

The MD plot shows that “net income”, “treasure stock” and “total cash flow from operating activities” have a high kurtosis in a small range of data centered around zero (Fig 9). “Interest expenses” and “capital expenditures” are highly negatively skewed. The last six features from right to left do not possess visible negative values.

The bean plot changes skewed distributions into unimodal or uniform distributions (Fig 10A). In the bean plot and ridgeline plot (Fig 10B) there are no hard cuts around zero (red line). Instead, approximately one-third or more of the distributions visualized lie below zero, contrary to the descriptive statistics where six features cannot have more than 1% of values below zero. In sum, the visualization of the MD plot is consistent with the descriptive statistics (Table A in S2 File) and inconsistent with the bean plot and ridgeline plot. The Python violin and ridgeline plots show values above and below the limits of [-250000, 1000000] and less detailed and incorrectly unimodal distributions (Fig F in S3 File, Fig F in S5 File).

Experiment VI: Range of values depending on features

In a dataset, the ranges of features often differ. For example, the range of MTY and the range of ITS (Income Tax Share, [Thrun/Ultsch, 2018; Ultsch/Behnisch, 2017]) vary widely, and the usual schematic plot would not show the distributions of both features simultaneously, which is visualized by the MD plot in Fig 11. With an option of the robust normalization [48] that is selected selected in the MD plot, the distributions can be investigated at once without changing the basic properties (Fig 11). As a result, the bimodality of the ITS feature becomes visible in the MD plot and in the bean plot (Fig A in S2 File). The violin plot, however, is unable to visualize the bimodal distribution, and the overlayed histogram underestimates it significantly (Fig E in S4 File). The Python density and violin plots draw data above and below the limits of the data but show the bimodality of the ITS feature (Fig A in S4 File, Fig G in S5 File). Statistical testing confirms that the distribution of ITS is not unimodal, p(n = 11194,D = 0.01196)< 2.2e-16.

Fig 11 — Visualization of the distribution of as few as two features at once is incorrect if the ranges vary widely (a). This is shown on the example of the MD plot (a). However, the MD plot enables the user to set simple transformations enabling the visualization of several distributions at once even if the ranges vary (b).

Discussion

If a simultaneous explorative distribution analysis of several features is required, the interesting basic properties of empirical distributions are depicted in Table 1: skewness, multimodality, normality, uniformity, data clipping, and the visualization of the varying ranges between features.

Usually, density estimation and visualization approaches are investigated independent of each other. Instead, the authors conflate the issue of density estimation with visualization following the perspective of Tufte, Wilk and Tukey that a graphical representation itself can be used as an instrument for reasoning about quantitative information [9, 55] (p.53). The results show that the MD plot is the only schematic plot which is appropriate for every case and where adjustments by various parameters are not required for its process of density estimation.

Three artificial and four natural datasets show limitations of the schematic plots of ridgeline, bean, and violin plots (R and Python versions). A comparison of results with conventional statistical testing and histograms is included. The results illustrate that the usefulness of the ridgeline, violin and bean plot depends on the density estimation approach used in the algorithm, and the density approach critically depends on the bandwidth of the kernel function.

For an artificial distribution of two equal sized Gaussians and a skewed Gaussian, statistical testing was performed with the dip statistic by changing the mean of the second Gaussian and using the D'Agostino test of skewness [D'Agostino, 1970] by changing the skewness parameter (sample size n = 15000). The minimal quality requirement with regard to schematic plots is that the visualizations must at least produce comparable results to (“are as sensitive as”) statistical testing and descriptive statistics. In this respect, the comparison of performance showed that the ridgeline, bean, ggplot2’s violin, and MD plot have a similar sensitivity in line with statistics for bimodality and skewness as long as a sample is large enough (Figs 2–5), but for smaller sample sizes the MD plot outperforms (see Figs 1 and 9 and Table A in S2 File). The sensitivity of the Python violin plot in these cases is comparable to the sensitivity of the bean plot in R. However, overlaying the MD plot with a robustly estimated Gaussian allows for an even higher sensitivity than statistical testing. Contrary to the bean plot and the Python violin plot, the MD plot does not indicate multimodality in uniform distributions.

Automatic ordering of the features makes skewness more clearly visible in the MD plot in comparison with the ridgeline, bean, and Python violin plot. The natural example of the log of German population’s income showed that for smaller samples (n = 500), the ridgeline and bean plot visualize unimodal distributions instead of skewed distributions, in contrast to the histogram and MD plots. Additionally, the ridgeline and bean plot visualize a mode that is partly above the maximum value of 4.35. The same behavior regarding stretching over the valid value range and stronger smoothing of the representation could also be observed with the Python versions. The general recommendation is that “the larger the share of graphics ink devoted to data, the better, if other relevant matters being equal [55], (p 96). Tukey and Wilk suggest avoiding undue complexity of form in summarizing and displaying [9], p. 377. Tuftte strongly argues to “erase non-data-ink within reason” [55] (p.96). Hence, the tails of violin-like schematic plots should never extend past the range of data. For clipped data, the density estimates of the MD plot do not change, contrary to the bean plot.

Kampstra proposed adding a rug (1D scatter plot) to the violin plot in the bean plot [5]. On the one hand, plotting points in a marginal distribution can easily be misleading [56] (Fig 1), and the general recommendation is that “the number of information-carrying dimensions […]depicted should not exceed the number of dimensions in data” [55] (p.71). On the other hand, if only a handful of unique values are present in the data, then density estimation is inappropriate. Thus, the MD plot does not overlay the density estimation with the 1D scatter plot. Instead, it switches automatically to 1D jittered scatter plots if density estimation results in one or more Dirac delta distributions (e.g., the error rates taken from [57] in S6 File, section 5). The scatter plots are jittered, allowing for a minor indication of the amount of data having one unique value.

Violin plots in R strongly depended on specific parameter settings in order to visualize the bimodality, which was surprising. As suggested by the name, the violin plot is particularly intended to identify multimodality by exposing a waist between two modes of the distribution, since the box plot is unable to visualize it. Additionally, the R version violin plots underestimate the skewness of the distributions. It was illustrated that histograms were less sensitive in the case of bimodality because the default binwidth was not small enough. The effects found in the ridgeline, bean and Python violin plot for skewed distributions and clipped data were outlined further in the high-dimensional case of financial statements of the companies listed on the German stock market [53]. As an example, 12 features were selected. Here, the visualizations of the ridgeline and bean plot produced an entirely misleading interpretation of the data, unlike the MD plot (cf. Table A in S2 File). The parameter settings of all plots, apart from supplementary information F, remained at default for the reason that a non-expert user would not have the capacity for changing them and an expert user would be faced with difficulties setting density estimation parameters in a solely explorative approach for each feature separately. The effects of tuning parameters are presented exemplary for the ggplot2 method geom_violin in S6 File (section 4). Certainly, many methods can be tuned to obtain a correct result for a specific distribution if prior knowledge is used. However, the example outlines that tuning parameters for one distribution results in an incorrect visualization for another distribution. Although the Python ridgeline and the violin plots use density estimators implemented in different packages, both plots show only marginally different results with the default setting.

The general performance of MD plot seems to be sufficient for data set of sizes up to 10^5. Pareto density estimation and, subsequently, the Pareto radius has to be computed for each feature separately, which increases the computation time accordingly. Therefore, a parallel implementation of the density estimation is planned in the next iteration. Above 10^5, Pareto density estimation becomes computationally intensive. For big data sets (>10^5) MD plot uses per default an appropriate subsampling method. PDE was not investigated below a sample size of 50 [11]. Thus, below this threshold, no density estimation is performed in the default setting. Instead, a 1D scatter plot with jittered points is drawn. It should be noted that the Pareto density estimation (PDE), which is used in the MD plot, is specially designed for the detection of multimodality, which could result in an overestimation of multimodality. Such an overestimation would be visible in the “roughness” of the mirrored density of a feature.

Literature suggests that schematic plots should be wider than they are tall because such shapes usually make it easier for the eye to follow from left to right [3] (p. 129). Small multiples of the type of schematic plots usually present several features with the same graphical design structure at once. Tufte suggests that “If the nature of the data suggests the shape of the graphic follow the suggestion” [55]. Therefore, in the opinion of the authors, the vertical display of box plots [3] should be favoured to the horizontal counterpart of range parts [7], and other schematic plots such as violin plots [4] should be displayed vertically.

One of the key factors of graphical integrity is to show data variation and not design variation [55]. The schematic plots investigated here are supposed to visualize such variation by density estimation. Nonsymmetric displays are more useful in the specific task of comparing pairs of distributions to each other. Although bilateral symmetry doubles the space consumed in a graphic without adding new information, redundancy can give context and order to complexity, facilitating comparisons over various parts of data [55] (p.98). The goal of the MD plot is to make it easy to compare PDFs that are often complex. Accordingly using a symmetrical display; clipping, skewness and multimodalities are more visible in data as opposed to nonsymmetrical displays if the body of the symmetric line defined by density estimation is filled out.

In sum, the results illustrate that the MD plot can outperform histograms and all other schematic plots investigated and congruent with descriptive statistics. However, following the argumentation of Tukey and Wilk [9] in p. 375, it is more difficult to absorb broad information from tables of descriptive statistics than it is to plot all features in one picture. Typically, skewness and multimodality for each feature in Table A in S2 File would have been statistically tested, leading to an even bigger table. The MD plot offers several advantages in addition to a simple density estimation of several features at once. 1D-scatter plots below a threshold proved very helpful for the benchmarking of clustering algorithms because, in several cases, the performance evaluation yielded discrete states (see S6 File, section 5). To the knowledge of the authors, this has yet to be reported in the literature. The MD plot allows us to investigate distributions after common transformations such as robust normalization and the overlaying of distribution with robustly estimated Gaussians. The usage of transformations is often astonishingly effective [9], p. 376. For example, using the robust transformation in combination with this type of overlaying increased the sensitivity of the tendency that a dataset possesses cluster structures compared with usual statistical testing of the 1^st principal component [58]. Wilk and Tukey argued to “plot the results of analysis as a routine matter” [9], p.380, for which the MD plot can be a useful tool. For example, ordering features by distribution shapes proved to be helpful if the performance of classifiers is evaluated by cross-validations [59]. If the advantages are combined with the ggplot2 syntax, they provide detailed error probability comparisons [60] with a high data to ink ratio (c.f. [55] (p. 96).

Conclusion

This work indicates that the currently available density estimation approaches in R and Python can lead to major misinterpretations if the default setting is not adjusted. On the one hand, adjusting the parameters of conventional plots would require prior knowledge or statistical assumptions about the data, which is generally challenging to acquire. On the other hand, the effective laying open of the data to display the unanticipated, is a major portion of data analysis [9], p. 371. In this case of strictly exploratory data mining, we propose a parameter-free schematic plot, called the mirrored density plot. The MD plot represents the relative likelihood of a given feature (variable) taking on specific values, using the PDE approach, to estimate the PDF. PDE is slivered in kernels with a specific width. The width, and therefore the number of kernels, depends on the data. The MD plot enables the user to estimate the PDFs of many features in one visualization. Both artificial data and natural examples forming multimodal and skewed distributions were used to show that the MD plot is a good indicator in a case of bimodal as well as skewed distributions for small and large samples. All other approaches had intrinsic assumptions about the data, which in some cases led to misguiding interpretations of the basic properties. The MD plot possesses an explicit model of density estimation based on information theory and is parameter-free as defined by a data-driven kernel radius, contrary to the commonly used density estimation approaches (like bean and violin plot). Furthermore, the MD plot has the advantage of visualizing the distribution of a feature correctly in the case of data clipping and varying ranges of features. In future research, a blind survey should be conducted to investigate how well a lay person can detect all underlying structures from the MD plots alone. In sum, the MD plot enables non-experts to easily apply explorative data mining by estimating the basic properties of the PDFs (distributions) of many features in one visualization when setting several parameters is difficult.

Combining the MD plot with a(n) (un-)supervised index is an excellent approach to evaluating the stability of stochastic clustering algorithms (e.g., [15]) or classifiers. Furthermore, it can be used with quality measures for dimensionality reduction methods to compare projection methods (e.g., [15]). The MD plot is integrated into the R package ‘DataVisualizations’ on CRAN [46] in the framework of ggplot2 and in the Python package ‘md_plot’ on PyPi [47].

Supporting information

S1 File. ITS and MTY.

(DOCX)

Click here for additional data file.^{(263.2KB, docx)}

S2 File. Descriptive statistics.

(DOCX)

Click here for additional data file.^{(18.9KB, docx)}

S3 File. Conventional violin plot in Python.

(DOCX)

Click here for additional data file.^{(5.6MB, docx)}

S4 File. Overlayed histograms.

(DOCX)

Click here for additional data file.^{(1.3MB, docx)}

S5 File. Density and ridgeline plots in Python.

(DOCX)

Click here for additional data file.^{(1.5MB, docx)}

S6 File. Violin plot of ggplot2.

(PDF)

Click here for additional data file.^{(919.9KB, pdf)}

S7 File

(DOCX)

Click here for additional data file.^{(359.3KB, docx)}

Acknowledgments

We thank Felix Pape for the first implementation of the MD plot in the R package ‘DataVisualizations’ and Hamza Tayyab for programming the web scraping algorithm that was used to extract the quarterly statements. Special thanks go to Monika Sikora for language revision of this article and Martin Thrun for figure post-processing.

Data Availability

Data is attached to packages in R: https://CRAN.R-project.org/package=DataVisualizations and in python: https://pypi.org/project/md-plot/

Funding Statement

The authors received no specific funding for this work.

References

1.Michael JR. The stabilized probability plot. Biometrika. 1983;70(1):11–7. [Google Scholar]
2.Wilk MB, Gnanadesikan R. Probability plotting methods for the analysis for the analysis of data. Biometrika. 1968;55(1):1–17. [PubMed] [Google Scholar]
3.Tukey JW. Exploratory data analysis. Mosteller F, editor. United States Addison-Wesley Publishing Company; 1977. 688 p. [Google Scholar]
4.Hintze JL, Nelson RD. Violin plots: a box plot-density trace synergism. The American Statistician. 1998;52(2):181–4. [Google Scholar]
5.Kampstra P. Beanplot: A boxplot alternative for visual comparison of distributions. Journal of Statistical Software, Code Snippets. 2008;28(1):1–9. 10.18637/jss.v028.c01 [DOI] [Google Scholar]
6.Wilke CO. Fundamentals of data visualization: a primer on making informative and compelling figures: O'Reilly Media; 2019.
7.Spear ME. Charting statistics. New York: McGraw-Hill; 1952. [Google Scholar]
8.McGill R, Tukey JW, Larsen WA. Variations of box plots. The American Statistician. 1978;32(1):12–6. [Google Scholar]
9.Tukey JW, Wilk MB. Data analysis and statistics: techniques and approaches. The quantitative analysis of social problems. 1970:370–90. [Google Scholar]
10.Wickham H, Stryjewski L. 40 years of boxplots. Am. Statistician: 2011. [Google Scholar]
11.Ultsch A. Pareto density estimation: A density estimation for knowledge discovery. In: Baier D, Werrnecke KD, editors. Innovations in classification, data science, and information systems. Proceedings o f the 27th Annual Conference of the Gesellschaf für Klassifikation. 27. Berlin, Germany: Springer; 2005. p. 91–100.
12.Hartigan JA, Hartigan PM. The dip test of unimodality. The annals of Statistics. 1985;13(1):70–84. [Google Scholar]
13.Freeman JB, Dale R. Assessing bimodality to detect the presence of a dual cognitive process. Behavior research methods. 2013;45(1):83–97. 10.3758/s13428-012-0225-x [DOI] [PubMed] [Google Scholar]
14.D'Agostino RB. Transformation to normality of the null distribution of g1. Biometrika. 1970;57(3):679–81. [Google Scholar]
15.Thrun MC. Projection Based Clustering through Self-Organization and Swarm Intelligence. Heidelberg: Springer; 2018. [Google Scholar]
16.Lötsch J, Ultsch A. Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data. International Journal of Molecular Sciences. 2020;21(1):79. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Thrun MC. Knowledge Discovery in Quarterly Financial Data of Stocks Based on the Prime Standard using a Hybrid of a Swarm with SOM In: Verleysen M, editor. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN); 24–26 April Bruges, Belgium: Ciaco, 978-287-587-065-0; 2019. p. 397–402. [Google Scholar]
18.Hyndman RJ, Fan Y. Sample quantiles in statistical packages. The American Statistician. 1996;50(4):361–5. [Google Scholar]
19.Murtagh F, Legendre P. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? Journal of classification. 2014;31(3):274–95. [Google Scholar]
20.Wilkin GA, Huang X, editors. K-means clustering algorithms: implementation and comparison. Second International Multi-Symposiums on Computer and Computational Sciences (IMSCCS 2007); 2007: IEEE.
21.Linde Y, Buzo A, Gray R. An algorithm for vector quantizer design. IEEE Transactions on communications. 1980;28(1):84–95. [Google Scholar]
22.Ultsch A, Thrun MC, Hansen-Goos O, Lötsch J. Identification of Molecular Fingerprints in Human Heat Pain Thresholds by Use of an Interactive Mixture Model R Toolbox (AdaptGauss). International journal of molecular sciences. 2015;16(10):25897–911. 10.3390/ijms161025897 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Zhang C, Mapes BE, Soden BJ. Bimodality in tropical water vapour. Quarterly Journal of the Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorology and physical oceanography. 2003;129(594):2847–66. [Google Scholar]
24.Bryson MC. Heavy-tailed distributions: properties and tests. Technometrics. 1974;16(1):61–8. [Google Scholar]
25.Levy M, Solomon S. New evidence for the power-law distribution of wealth. Physica A: Statistical Mechanics and its Applications. 1997;242(1–2):90–4. [Google Scholar]
26.Alstott J, Bullmore E, Plenz D. powerlaw: a Python package for analysis of heavy-tailed distributions. PloS one. 2014;9(1):e85777 10.1371/journal.pone.0085777 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Jordanova PK, Petkova MP, editors. Measuring heavy-tailedness of distributions. AIP Conference Proceedings; 2017: AIP Publishing.
28.Royston P. Which measures of skewness and kurtosis are best? Statistics in Medicine. 1992;11(3):333–43. 10.1002/sim.4780110306 [DOI] [PubMed] [Google Scholar]
29.Ferreira JTS, Steel MFJ. A constructive representation of univariate skewed distributions. Journal of the American Statistical Association. 2006;101(474):823–9. [Google Scholar]
30.Scott DW. Multivariate density estimation: theory, practice, and visualization: John Wiley & Sons; 2015. [Google Scholar]
31.Keating JP, Scott DW. A primer on density estimation for the great homerun race of 1998. STATS. 1999;25:16–22. [Google Scholar]
32.Sievert C, Parmer C, Hocking T, Scott C, Ram K, Corvellec M, et al. plotly: Create Interactive Web Graphics via 'plotly.js'. 4.7.1 ed. CRAN2017. p. R package.
33.Benjamini Y. Opening the box of a boxplot. The American Statistician. 1988;42(4):257–62. [Google Scholar]
34.Bowman AW, Azzalini A. Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations. New York, United States: Oxford University Press; 1997. [Google Scholar]
35.Adler D. vioplot: Violin plot. 0.2 ed2005. p. R package.
36.Bowman AW, Azzalini A. R package sm: nonparametric smoothing methods. 2.2–5.4 ed. University of Glasgow, UK and Universit `a di Padova, Italia2014. p. http://www.stats.gla.ac.uk/~adrian/sm, http://azzalini.stat.unipd.it/Book_sm.
37.Team RC. R: A Language and Environment for Statistical Computing. Vienna, Austria2018. p. {R Foundation for Statistical Computing.
38.Venables WN, Ripley BD. Modern Applied Statistics with S. Fourth Edition ed. Chambers J, Eddy W, Härdle W, Sheather S, Tierney L, editors. New York: Springer; 2002. 501 p. [Google Scholar]
39.Wickham H. ggplot2. Wiley Interdisciplinary Reviews: Computational Statistics. 2011;3(2):180–5. [Google Scholar]
40.Sheather SJ, Jones MC. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society Series B (Methodological). 1991:683–90. [Google Scholar]
41.Wilke CO. ggridges: Ridgeline Plots in 'ggplot2. 0.5.1 ed2018. p. R package.
42.Waskom M, Botvinnik O, Hobson P, Cole JB, others, Allan D. seaborn: v0.5.0 (November 2014). 2001. p. Python package.
43.Jones E, Oliphant T, Peterson P, others. {SciPy}: Open source scientific tools for {Python}. 2001. p. Python package.
44.Racine JS. Nonparametric econometrics: A primer. Foundations and Trends® in Econometrics. 2008;3(1):1–88. [Google Scholar]
45.Sheppard K, pktd J, Brett M, Gommers R, Seabold S. statsmodels 0.10.2. 2019. p. Python package.
46.Thrun MC, Ultsch A. Effects of the payout system of income taxes to municipalities in Germany. In: Papież M, Śmiech S, editors. 12th Professor Aleksander Zelias International Conference on Modelling and Forecasting of Socio-Economic Phenomena; Cracow, Poland: Cracow: Foundation of the Cracow University of Economics; 2018. p. 533–42.
47.Gehlert T. md_plot: A Python Package for Analyzing the Fine Structure of Distributions. 2019. p. Python package.
48.Milligan GW, Cooper MC. A study of standardization of variables in cluster analysis. Journal of classification. 1988;5(2):181–204. [Google Scholar]
49.Thrun MC, Ultsch A, editors. Models of Income Distributions for Knowledge Discovery. European Conference on Data Analysis; 2015; Colchester.
50.Thrun MC, Hansen-Goos O, Griese R, Lippmann C, Lerch F, Lötsch J, et al. AdaptGauss. 1.3.3 ed. Marburg2015. p. R package.
51.Fernández C, Steel MF. On Bayesian modeling of fat tails and skewness. Journal of the American Statistical Association. 1998;93(441):359–71. [Google Scholar]
52.Ultsch A, Behnisch M. Effects of the payout system of income taxes to municipalities in Germany. Applied Geography. 2017;81:21–31. [Google Scholar]
53.Prime-Standard. Teilbereich des Amtlichen Marktes und des Geregelten Marktes der Deutschen Börse für Unternehmen, die besonders hohe Transparenzstandards erfüllen.: Deutsche Börse; 2018 [18.09.2018]. Available from: http://deutsche-boerse.com/dbg-de/ueber-uns/services/know-how/boersenlexikon/boersenlexikon-article/Prime-Standard/2561178.
54.Yahoo! Finance. Income statement, Balance Sheet and Cash Flow Germany: Microsoft Corp.; 2018 [cited 2018 29.09.2018]. Available from: https://finance.yahoo.com/quote/SAP/financials?p=SAP (Exemplary).
55.Tufte ER. The visual display of quantitative information: Graphics press Cheshire, CT; 2001. 197 p.
56.Brier SS, Fienberg SE. Recent econometric modeling of crime and punishment: support for the deterrence hypothesis? Evaluation Review. 1980;4(2):147–91. [Google Scholar]
57.Thrun MC, Ultsch A. Using Projection based Clustering to Find Distance and Density based Clusters in High-Dimensional Data. Journal of Classification. 2020. 10.1007/s00357-020-09373-2 [DOI] [Google Scholar]
58.Thrun MC. Improving the Sensitivity of Statistical Testing for Clusterability with Mirrored-Density Plot In: Archambault D, Nabney I, Peltonen J, editors. Machine Learning Methods in Visualisation for Big Data; Norrköping, Sweden: The Eurographics Association; 2020. [Google Scholar]
59.Hoffmann J, Rother M, Kaiser U, Thrun MC, Wilhelm C, Gruen A, et al. Determination of CD43 and CD200 surface expression improves accuracy of B-cell lymphoma immunophenotyping. Cytometry Part B: Clinical Cytometry. 2020:1–7. 10.1002/cyto.b.21936 [DOI] [PubMed] [Google Scholar]
60.Thrun MC, Ultsch A. Swarm Intelligence for Self-Organized Clustering. Artificial Intelligence. 2020;in press. 10.1016/j.artint.2020.103237 [DOI]

PLoS One. doi: 10.1371/journal.pone.0238835.r001

Decision Letter 0

Qichun Zhang

16 Dec 2019

PONE-D-19-19081

Analyzing the Fine Structure of Distributions

PLOS ONE

Dear Dr. rer. nat. Thrun,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we have decided that your manuscript does not meet our criteria for publication and must therefore be rejected.

I am sorry that we cannot be more positive on this occasion, but hope that you appreciate the reasons for this decision.

Yours sincerely,

Qichun Zhang, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

Two reviewers returned the critical comments focusing on the novelty of the manuscript. Basically, the author redo some existing method using Python where the new features have not been demonstrated clearly. In addition, the English writing should be pre-checked where some typos would affect the readability of the manuscript.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The paper draws important attention to the pitfalls of existing distributional visualizations for effectively summarizing the nuances of non-normal distributions. Particularly, the paper assesses the efficacy of existing graphical representations (violin plot, box plot, bean plot) for summarizing skewed, multimodal, and uniform distributions, and provides a context and implementation to introduce 'mirror density plots' as an alternative.

The context given for existing visual tools is sound, though there are some areas that could be improved:

- Regarding histograms in section 2.2: “…in this work, only default parameter will be used because layman would probably not adjust parameters”. Given that the target user of a statistical visualization package in R or Python likely has experience in data science or statistics, this assumption warrants re-examination. For example: adding (bins=“auto”) is a common procedure for researchers using matplotlib’s built-in histogram function (see: https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges). Additional context for histograms could be improved with an acknowledgement of non-uniform binning methods, though this may not be in scope for the paper.

- It is surprising that this paper contains no ordinary density plots accompanied by sub-axis rugs, which are common methods for analyzing distributions. Similarly, ridgeline plots do not make an appearance, either in discussions of existing visualization methods, or in schematic comparisons of multiple dimensions. Given that these graphical representations seem more common than bean plots, for example, which are discussed at length, background context would benefit from the inclusion of more ordinary non-symmetric representations of distributions.

- The paper would benefit from a ridgeline plot with a single axis, comparing the same non-normal distribution with various binning methods (e.g. each: default histogram using n=10; SciPy's "auto" method mentioned above; Scott's Rule mentioned in paper and detailed on SciPy link above; proposed PDE method; any others potentially relevant according to authors' literature review)

The fundamental scientific contribution of the paper is the usage of the Pareto Density Estimation (PDE) to construct a visualization of a univariate distribution which captures non-normal characteristics of distributions, such as skew, multimodality, and uniformity. This method appears well-supported, and is explained concisely, with easily accessible packages for both R and Python to supplement the work.

While the PDE method for binning appears well-defended, the implementation into a visual language leaves some questions unanswered:

- In the broader data visualization community, "Mirror Density" plots are bivariate distributions: for example, one might construct two violin plots of distributions, conditioned on a second binary value (e.g. control vs. experiment), split the resultant forms in half lengthwise, and position them opposite one another to create a comparative representation of the conditioned distributions (see: https://www.d3-graph-gallery.com/graph/density_mirror.html). In this bivariate application, the comparative symmetry adds value to the analytic process. It is unclear from the paper whether the authors are aware of this namespace convergence, but independent of nomenclature, the paper would benefit from an assessment of the analytic value for making a univariate density plot symmetric.

- In section 3.5 "The high-dimensional data set (d=45)... is investigated by selecting 12 features": Ridgeline plots with the PDE binning method may be a more space-conservative method of implementing the algorithm (though admittedly, d=45 remains a non-trivial 'curse of dimensionality').

- With regard to the German stock market data in section 3.5, the schematic MD (Fig. 9) and violin (Fig. 10) plots compare distributions in very different ranges. The paper would benefit from the removal of 'InterestExpense' and 'CapitalExpenditures' from the exemplary features, perhaps to be replaced with features of a range more similar to the other features in the schematic plots.

- In general, plots should be ordered where possible, e.g. Fig. 5b,c,d should show skew parameter xi in order [0.6,0.95,1,1.1] for clarity.

- Stacked histograms are not advisable for this application. Stacked histograms make sense when considering how categories sum to a total population (e.g. when exploring various revenue sources and the resultant aggregate revenue in a single graphic). For comparing model distributions of various skewness parameters, e.g. in Fig. 3a, 5a (are these the same histogram?), stacking does not seem appropriate. Neither does stacking seem appropriate for histograms of normalized data, e.g. Robustly Normalized values for Income Tax Share (ITS) and Municipality Income Tax Yield (MTY) in Fig. 12a. The sum total of 2 bins normalized from different ranges provides no substantial comparative analytic value. In both stacked histogram cases: overlaying, rather than stacking, may provide the intended visual effect, and would be appropriate to the data.

The paper's conclusion in Section 5 "current density estimation approaches can lead to major misinterpretations if the default setting is not adjusted" seems to suggest that the scientific community would benefit greatly from the addition of PDE binning methods to existing open-source visualization packages such as ggplot, matplotlib, seaborn and plotly. I hope the authors consider integrated contribution to existing open-source tools.

The paper would be improved by a spelling check, and a grammar proofread by a native English speaker. Typos and grammatical oversights do not obstruct communication, but do inhibit narrative flow.

Reviewer #2: This paper studies the mirrored desity (MD) plot and show various structures of the MD results. This paper proposed a MD plot implemented in Python. Since mirrorred density (MD) plot has been developed in R already, the contribution of this paper is not clearly justified. It is not clear what kind of new features this paper introduce into the MD plot.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Jane Lydia Adams

Reviewer #2: No

PLoS One. 2020 Oct 14;15(10):e0238835. doi: 10.1371/journal.pone.0238835.r002

Author response to Decision Letter 0

9 Jan 2020

Dear Editor-in-Chief, Dr. Joerg Heber, Dear Editorial Board,

We are very grateful for the important suggestions written by the first reviewer. However, we are shocked by the quality of the review of the second reviewer and the oversight of the handling editor. This is the first time I ever heard of or read a review in a high-class journal with evidently false statements and no scientific contribution at all. That is not proper handling of a manuscript or reviewing. Due to the anonymity of the second reviewer, we assume competing interests. We hereby report scientific misconduct.

We have addressed false statements as detailed out in the following (our responses below the reviewers’ comments in red letters). We address the positive and helpful scientific review of the first reviewer shortly afterwards. The claims that the first reviewer has written a positive review and the second reviewer either did not read the manuscript or had competing interests become evident.

Since 20.12.19, the manuscript was reinstated and a revision was allowed. Therefore, we take the opportunity to revise the manuscript in accordance with the suggestions of the first reviewer.

Yours Sincerely

Michael Thrun, Tino Gehlert, and Alfred Ultsch

General Comments:

Comments to the Author  1. Is the manuscript technically sound, and do the data support the conclusions?  The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: No

3. Have the authors made all data underlying the findings in their manuscript fully available?  The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

4. Is the manuscript presented in an intelligible fashion and written in standard English?  PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviews

Reviewer #2: “This paper studies the mirrored desity (MD) plot and show various structures of the MD results.”

This statement is incorrect. We propose a method for visualization of univariate probability density functions (pdf’s) in data consisting of several features to present the pdf’s in a single visualization. The results show the comparison to the state of the art methods. E.g., this was stated in the abstract of the manuscript:

“Data visualization tools should deliver a sensitive picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs are typically kernel density estimates and range from the classical histogram to modern tools like bean or violin plots. These visualization tools are evaluated in comparison to statistical tests for the typical challenges of explorative distribution analysis. Conventional methods have difficulties in visualizing the pdf in case of uniform, multimodal, skewed and clipped data if density estimation parameters remain in a default setting. As a consequence, a new visualization tool called Mirrored Density plot (MD plot) is proposed which is particularly designed to discover interesting structures in continuous features”

We also motivated the usage by:

“We compare the visualizations to basic descriptive statistic and show which visualization tools do not visualize the shapes of the pdf accurately. Table 1 summarizes the interesting basic properties from the perspective of data mining and the methods used to compare performance.

Table 1: Summary of basic properties of empirical distributions that are interesting for data mining.

Interesting basic Properties Exemplary data mining applications Statistical test used Descriptive Statistic

Uniformity versus multimodality Biomedical data (13)

Water vapour (14)

Hartigans’ dip test (6)

Difference between mean and median can indicate multimodality, several coefficients (14)

Data clipping versus Heavy-tailedness Flood data (15), Upper Income (16).

Not required here, but we can refer to (15, 17)

Range of data is sufficient for the task. “There is no easy characteristic for heavy-tailedness” (18)

skewness versus normality Biomedical data (19), Strength of Glass Fibres & Market Value Growth (20)

D'Agostino test(8)

Third order statistics, e.g. (19)

”

We added in Table 1 Water vapor as further application and exemplary coefficients for bimodality.

This paper proposed a MD plot implemented in Python.

This statement is false. We propose a method which is implemented in R and Python and compare this method with state-of-the-art methods in R and Python. We provide in R a vignette and in Python an introductory tutorial:

Examplary stated in the methods section of the manuscript in 2.1:

“To make sure that the here introduced MD plot does not depend on the specific implementation we provide the package two different programming languages (R and python) reproducing the R results presented in this manuscript below in the python tutorial attached to this work.”

Examplary the reviewed manuscript had the sentences in the methods section 2.3:

“The MD plot can be applied by using the R package ‘DataVisualizations’ on CRAN (21). In the next section, the visual performance of indicating the correct distribution of features is investigated by the histogram, violin, and bean plot in comparison to MD plot. The Python implementation of the MD plot is provided in the Python package ‘md_plot’ on PyPi (22). The vignettes describing the usage and providing the data are attached to this work for the two most-common data science programming languages Pythongand R.

And in the conclusion section:

“The MD plot is available in the R-package ‘DataVisualizations’ on CRAN (21) and in the

Python package ‘md_plot’ on PyPi (22).”

21. Thrun MC, Ultsch A. Effects of the payout system of income taxes to municipalities in Germany.

In: Papież M, Śmiech S, editors. 12th Professor Aleksander Zelias International Conference on

Modelling and Forecasting of Socio-Economic Phenomena; Cracow, Poland: Cracow: Foundation

of the Cracow University of Economics; 2018. p. 533-42.

22. Gehlert T. md_plot: A Python Package for Analyzing the Fine Structure of Distributions. 2019.

Python package.“

Using the insights of the reviewer below, we specified the sentence to

“The MD plot is integrated into the R-package ‘DataVisualizations’ on CRAN (37) in the framework of ggplot2, and in the Python package ‘md_plot’ on PyPi (38)“

Since mirrorred density (MD) plot has been developed in R already, the contribution of this paper is not clearly justified.

This statement is false. One important goal is to show that our algorithm works indepently to the programming language and specific implementation. The authors would like to refer to 2.1:

“Comparing visualizations is challenging because they have the same issues as the estimation of quantiles or clustering algorithms like k-means or Ward: they depend on the specific implementation (c.f. (9), (10),(11, 12)). Therefore, this work restricts the comparison to several conventional methods and specifies the programming language, package and pdf estimation approach used in order to outline several relevant problems for visualization the basic properties of the pdf. To make sure that the here introduced MD plot does not depend on the specific implementation we provide the package two different programming languages (R and python) reproducing the R results presented in this manuscript below in the python tutorial attached to this work.”

The method MD plot was developed for this manuscript by the first author. It is included for technical reasons in the package ‘DataVisualizations’ which has initially been development by the first author and others for various conventional visualization shortcuts and approaches used in Data Science. As every R package on CRAN requires a citation, this package itself is cited with a conference publication about the visual investigation of correlation coefficients. A short look into the description file on CRAN would yield

“Gives access to data visualisation methods that are relevant from the data scientist's point of view. The flagship idea of 'DataVisualizations' is the mirrored density plot (MD-plot) for either classified or non-classified multivariate data presented in Thrun et al. (2019) <arXiv:1908.06081>. ”

which clearly references this publication with a preprint published in arXiv as suggested by PLOS ONE. For any user of packages in R or python (i.e. all data scientists) this is obvious.

It is not clear what kind of new features this paper introduce into the MD plot.

As the reviewer evidently did not even read the abstract of the mansucript, it is indisputable that the features are unclear for the reviewer.

The context given for existing visual tools is sound, though there are some areas that could be improved:

We thank the first reviewer for the clear stating that our methodology is sound and problems shown in the paper important.

We agree with the reviewer that parameters of binning are not the scope of this paper because we already provide 30 figures in this manuscript.

Additionally, besides the “auto” parameter in Python, the setting of parameters for the binning of histograms in any exploratory analysis is unfeasible as we stated in the manuscript:

“In the last step, we exploratively investigate a new data set with several features with unknown basic properties in order to summarize the challenges of visualizing the estimated probability density function. In such a typical data mining setting, it would be a very challenging task to adjust parameters of the conventional visualizations tools investigated here. We compare the visualizations to basic descriptive statistics and show which visualization tools do not visualize the shapes of the pdf accurately.”

With the suggestion of the reviewer to use ridge line plots, we move the histograms into the supplementary D and instead use ridge lines plots in the manuscript itself.

We thank the reviewer for the suggestion to include ridgelines plots consistently throughout the paper instead of histograms using the ggridges R packages on CRAN, and the ‘kdeplot’ function of the ‘seaborn’ package in supplementary E.

We changed the abstract from

“Visualization tools for PDFs are typically kernel density estimates and range from the classical histogram to modern tools like bean or violin plots.”

“Visualization tools for PDFs are typically kernel density estimates and range from the classical histogram to modern tools like ridgeline plots, bean or violin plots.”

We added to the methods section in the case of R

“Another approach consists of ridgeline plots. “Ridgeline plots are partially overlapping line plots that create the impression of a mountain range” (31). They are in R available in the ggridges packages on CRAN (31) and either use the density estimation approaches of R discussed above (if set manually) or per default “estimates the data range and bandwidth for the density estimation from the entire data at once, rather than from each individual group of data” (31). The default setting is used in this work.”

And in the case of Python

“The density plots and ridgeline plots in Python presented in supplementary E are created by using the ‘kdeplot’ function of the ‘seaborn’ package. This approach uses the density estimation by Racine (34) implemented in the ‘statsmodel’ package (35) if it is installed. If it is not installed, the density estimation of ‘scipy’ is used.

“

We added to the results section the parts regarding the ridgeline plot in R and Python and the changed references with regards to the histogram in plotly. For a better overview, we marked these minor changes in this letter using Microsoft Word’s review modus:

“Initially, a random sample of 1000 points of a uniform distribution was drawn and visualized by a commonly used histogram methodridgeline plot, violin plot, bean plot, and MD plot (Fig. 1) and histogram (SI D, Fig.19) as well as for density estimation in python (SI C, Fig.13, SI E Fig. 24).”

“Fig. 1: Uniform distribution in the interval [-2,2] of a 1000 points sample visualized by a histogram ridgeline plot (a) of plotly ggridges on CRAN (32) with a default binwidth (top) of plotly and bottom: violin plot (b, left), bean plot (c, middle) and MD plot (d, right). In the ridgeline plot violin plot and bean plot, the borders of the uniform distribution are skewed contrary to the real amount of values around the borders 2,-2. The bean plot and ridgeline plot indicate multimodality but Hartigans’ dip statistic (6) disagrees: p(n=1000,D = 0. 01215)= 0.44.”

“Contrary to the expectation, the ridgeline plot, histogram and bean plot indicate multimodality and bean plot, ridgeline plot, and violin plot bend the pdf line in the direction of the end points.”

“This result is visualized in Fig. 3. The bimodality is visible in the ridgeline plot and bean plot starting with a mean equal to 2.4, and in the MD plot starting with a mean equal to 2.4. However, but a robustly estimated Gaussian in magenta is overlaid in the MD plot making bimodality visible starting from a mean of 2.2. Hartigans' dip statistic (6) agrees with these two schematic plots. In contrast, violin plots do not show a bimodal distribution (Fig. 3), while the Python violin and ridgeline plots shows the bimodality starting with a mean equal to 2.4 (SI. C, Fig. 14, SI E, Fig. 25).”

“Fig. 3: Plots of bimodal distribution of changing mean of second Gaussian: Stacked histogramRidgeline plots (a) of ggridges on CRAN (32), violin plot (b), bean plot (c), and MD plot (d). Bimodality is visible beginning from mean 2.4 in bean lot ridgeline plots and MD plot, but the MD plot draws a robustly estimated Gaussian (magenta) if statistical testing is not significant which indicates in mean of two that the distributions is not unimodal. The bimodality of the distribution is not visible in the violin plot”

“Unlike the R version, the skewness is visible in the Python version of the violin plot (SI. C, Fig. 15, SI E, Fig. 26), but slightly less sensitive as the bean plot and MD plot. In the histogram,, the skewness of the distribution is difficult to recognize ((SI D, Fig 21),Fig. 5).”

“Fig. 5: Plots of skewed normal distribution by changing the skewness using the R package fGarch (43) on CRAN: Stacked histogramRidgeline plots (a) of ggridges on CRAN (32), violin plot (b), bean plot (c) and MD plot (d). The sample is with n=15000 large. The histogram and violin plot is less sensitive for the skewness of the distribution. MD plot allows for an easier detection of skewness by ordering the columns automatically.”

“This issue can also be observed with the Python violin and density plots (SI. C, Fig. 16, SI. E, Fig. 27).”

“In Fig 8, it is visible that the violin plot underestimate skewness of the distribution contrary to the MD plot. The ridgeline, violin and bean plot shows a mode in the skewed distribution between 4 to 4.5 contrary to the MD plot (Fig. 8). In Fig. 7SI D, Fig. 22, the histogram agrees with the MD plot and disagrees with the bean plot that there are no values above 4.35 meaning that the ridgeline and bean plot visualizes a pdf above the maximum value (marked with red lines”

“The Python density and violin plot shows, like the bean plot, values above 4.35, but smoothes the distrubition more (SI. C, Fig. 17, SI E, Fig. 28) and, hence, does not indicate multimodality.”

“Fig. 7: Distribution Analyses performed on the log of German people’s income in 2003 with Ridgeline plots (a) of ggridges on CRAN (37) does not indicate clipping or multimodality. a histogram of plotly (24) with a default binwidth”

“In such a high-dimensional case, statistical testing, parameter settings, usual density plots. and histograms become very troublesome and thus are omitted in this work. Moreover, it becomes challenging to integrate different ranges in one visualization. In Tab. 1, SI B, the ordering of the descriptive statistics is from top to bottom is the same as in MD plot, ridgeline and bean plot from left to right. The MD plot enables ordering by concavity which is used here. The MD plot (Fig 9), and the bean plot (Fig. 10a) and the ridgeline plot (Fig 10b) visualizes all variables in one picture.”

“The bean plot changes skewed distributions to distributions with one mode or uniform distributions (Fig. 10a). In the bean plot and ridgeline plot (Fig. 10b) tThere is are no hard cut around the value zero (red line).”

“In sum, the visualization of MD plot is in agreement of descriptive statistics (SI B, S1 Table) and disagreement with the bean plot and ridgeline plot. The Python violin and ridgeline plots is showing values above and below the limits of [-250000, 1000000] and a less detailed, and incorrectly unimodal distributions. (SI. C, Fig. 18, SI E, Fig. 29).”

“Fig. 10: Bean plots of selected features from 269 companies on the German stock market reporting quarterly financial statements by the Prime standard (top, a) and Ridgeline plots (b, bottom) of ggridges on CRAN (37). The ordering of the features is by concavity and the same as in Fig. 9. There is no hard cut around the value zero (red line) and the variables are unimodal or uniform with a large variance and a small skewness. This The visualizations disagrees with the descriptive statistics in SI B, S1 Table. Note, that for a better comparison we disabled the additional overlaying plots in beanplots.”

“However, the violin plot is unable to visualize the bimodal distribution and the stackedoverlayed histogram underestimates it significantly (SI. DB, Fig. 1223). The Python density and violin plots draws data above and below the limits of the data, but is showing the bimodality of the ITS feature (SI. C, Fig. 19, SI E, Fig. 30).”

In the discussion:

“Three artificial and four natural datasets show the limitations of the schematic plots of ridgeline plot, bean plot, violin plot (R and Python versions). A comparison of results to conventional statistical testing and histograms is included. The results illustrate that the usefulness of the ridgeline plot, violin or so-called bean plot depends on the density estimation approach used in the algorithm, Three artificial and four natural datasets show the limitations of the schematic plots of ridgeline plot, bean plot, violin plot (R and Python versions). A comparison of results to conventional statistical testing and histograms is included. The results illustrate that the usefulness of the ridgeline plot, violin or so-called bean plot depends on the density estimation approach used in the algorithm,…”

“…by changing the skewness parameter (sample size n=15000). Statistical testing, the ridgeline plot, bean plot and MD plot have a similar sensitivity regarding bimodality and skewness as long as the sample is large enough.”

“Automatically ordering the features makes skewness more clearly visible in the MD plot in comparison to the ridgeline plot, bean plot and Python violin plot. The natural example of the Log of German peoples income showed that for smaller samples (n=500) the ridgeline plot, bean plot visualizes unimodal distributions instead of skewed distributions disagreeing with the histogram and MD plot. Additionally, the ridgeline plot, bean plot….”

“with the Python violin plotversions. For clipped data, the density estimates of the MD plot does not change contrary to the bean plot.”

“The parameter settings of all plots until the last experiment remained at default because a non-expert user would not change them and an expert user would have difficulties to set density estimation parameters in a solely explorative approach for each feature seperatly. Although the Python ridgeline and the violin plot use density estimators implemented in different packages, both only show marginally different results with the default setting.

”

In supplementary E we now state:

“This section covers density plots and ridgeline plots created by using the ‘kdeplot’ function of the ‘seaborn’ package. The default value (Scott's rule of thumb) of the bandwidth parameter was used.”

We want to remark that in exploratory data analysis, the setting of any parameters would become infeasible because the real pdf is unknown, and the parameters would have to be set for each pdf estimation seperatly which cannot be done for many features manually in a limited time.

This is specified in the manuscript in the discussion:

From

The parameter settings of all plots until the last experiment remained at default because a non-expert user would not change them and an expert user would have difficulties to set density estimation parameters in a solely explorative approach”

And in the conclusion it was stated in the manuscript already:

“Adjusting the parameters of conventional plots would require prior knowledge or statistical assumptions about the data, which are often difficult to acquire.”

Additonally, investigating various parameters would overload the manuscript as we alreardy provide 30 figures.

Thank you for getting the concept of our paper and understanding the usual references of the packages.

While the PDE method for binning appears well-defended, the implementation into a visual language leaves some questions unanswered:

Of this ambiguous naming of plots, we were not aware. We discuss this ambivalence in the corrected version of the manuscript in the methods section

“It should be noted that there exists an ambiguity in the naming because of the existence of “Mirror Density” plots, a graphical representation of bivariate distributions (e.g., https://www.d3-graph-gallery.com/graph/density_mirror.html). However, maximum likelihood plots can be more informative for such a use case (e.g. (39)). In the opinion of the authors, the name “Mirrored Density” plot (MD plot) is more specific than the “Mirror Density” plot because the density estimation is univariate with a graphical representation as a “line” with a symmetrical reflection of the same information filled out with a color instead of using a bivariate density estimation which does not mirror the “line” of the plot.

We tried this approach, but for too many features, it becomes infeasible to inspect such a plot. We think that to outline this point is out of the scope of the manuscript.

We thank the reviewer for pointing out this issue. We now are aware of specifying our work more. We added to the results section

- In general, plots should be ordered where possible, e.g. Fig. 5b,c,d should show skew parameter xi in order [0.6,0.95,1,1.1] for clarity.

The authors thank the reviewer for the suggestion. The mistake was corrected.

The authors thank the reviewer for pointing out this major issue. The manuscript is now rewritten focusing more on ridgeline plots instead of histograms. Histograms are moved to supplementary D because they are challenging to use if many features are given, and the (in)correct binning issue can always be raced.

The authors thank the reviewer for pointing out the correct definition for stacked histograms. The wording was changed throught the manuscript from “stacked” to “overlayed” because stacked histograms were not computed in the definition of the reviewer above. To be more precise supplementary D has now the sentence:

“Each histogram is computed seperatly and thereafter integrated in one plot using plotly.”

The authors did a mistake by providing both times an overlayed histogram for the skewed experiment instead providing an overlayed histogram for the skewed distribution and one for the bimodal experiment. This is now also corrected.

As a reviewer with experience in data visualization, I feel confident in my assessment of this component of the paper; however, it is my hope that fellow peer reviewers can speak in a more informed manner on the statistical evaluations and experiments performed. The paper's conclusion in Section 5 "current density estimation approaches can lead to major misinterpretations if the default setting is not adjusted" seems to suggest that the scientific community would benefit greatly from the addition of PDE binning methods to existing open-source visualization packages such as ggplot, matplotlib, seaborn and plotly. I hope the authors consider integrated contribution to existing open-source tools.

MD plot was integrated in the ggplot2 syntax and specified in the methods section

“The MD plot can be applied by using installing the R package ‘DataVisualizations’ on CRAN (36) in the framework of ggplot2 (39).”

And in the conclusion section we changed

“The MD plot is available in the R-package ‘DataVisualizations’ on CRAN (33) and in the Python package ‘md_plot’ on PyPi (34). “

“The MD plot is available integrated into the R-package ‘DataVisualizations’ on CRAN (36) in the framework of ggplot2, and in the Python package ‘md_plot’ on PyPi (37).

“

Our co-author will try to contact seaborn regarding this idea if our work gets published in a peer-review journal.

The paper would be improved by a spelling check, and a grammar proofread by a native English speaker. Typos and grammatical oversights do not obstruct communication, but do inhibit narrative flow.

The non-native authors are terribly sorry for the inconvenience. We now applied Google’s Grammarly for spell checking. If accepted, Springer Nature could be payed for grammatical corrections and spell-checking.

The Reviewers lead to the following decision by the handling editor Qichun Zhang:

“

PONE-D-19-19081

Analyzing the Fine Structure of Distributions

PLOS ONE

Dear Dr. rer. nat. Thrun,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we have decided that your manuscript does not meet our criteria for publication and must therefore be rejected.

I am sorry that we cannot be more positive on this occasion, but hope that you appreciate the reasons for this decision.

Yours sincerely,

Qichun Zhang, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

“

The authors have shown in this letter that

the first reviewer positively reviewed our manuscript, and all suggestions of this reviewer were applied throught the mansuscript

the second reviewer, as well as apparently the handling editor, did not read the manuscript at all.

It seems that the handling editor just performed a copy and paste action of the evidently false statements of the second reviewer ignoring the first reviewer's comments except for the last comment regarding the spelling.

To find out that the statements are false would take less than an hour of work. Any editor should at least be curious about a review consisting of four sentences after a long waiting period of 5 months and check the claims of such a review.

Additionally, minor spelling and grammatical errors should not influence the decision of an editor in a scientific journal at all.

Attachment

Submitted filename: Rebuttal_V2.docx

Click here for additional data file.^{(43.9KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0238835.r003

Decision Letter 1

Fatemeh Vafaee

21 Apr 2020

PONE-D-19-19081R1

Analyzing the Fine Structure of Distributions

PLOS ONE

Dear Dr. rer. nat. Thrun,

We would appreciate receiving your revised manuscript by Jun 05 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Dr Fatemeh Vafaee and Dr David Mayerich

Academic Editors

PLOS ONE

Journal requirements:

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. 2. Thank you for stating the following financial disclosure:

'No'

a) Please provide an amended Funding Statement that declares *all* the funding or sources of support received during this specific study (whether external or internal to your organization) as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now.

b) Please state what role the funders took in the study. If any authors received a salary from any of your funders, please state which authors and which funder. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

c) Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

3. Thank you for stating the following in your Competing Interests section:

'No'

a. Please update your Competing Interests statement to state any Competing Interests. If you have no competing interests, please state "The authors have declared that no competing interests exist.", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now

b. This information should be included in your cover letter; we will change the online submission form on your behalf. Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

4. Please ensure that you refer to Figure 7 in your text as, if accepted, production will need this reference to link the reader to the figure.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #4: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #4: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #4: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have made significant improvements to the manuscript in accordance with reviewer comments. These improvements include the addition of more exemplary data mining applications, new visualization methods with thorough comparative assessments, and careful clarification of novel scientific contribution (which was present in the initial draft, but has been more explicitly declared in the revision).

There are a few minor formatting and language changes needed before publication:

Page 6: “visualizing the b of the estimated probability density distribution (pdf) which will be called in short the distribution of the variable”: If ‘b’ is a variable, it is recommended that it be italicized to avoid confusion.

Grammarly unfortunately doesn’t catch ‘atomic typos’, so be on the lookout for those in final revision. For example, p.11: “The bean plot has a mayor limitation” → “major”, p. 19 acknowledgements: “web scrapping” → ‘web scraping’. Note also p.12: "distrubition” → ‘distribution’.

Please add:

- label to y-axis in Fig. 1a

- y-axis values to Fig. 20, 21

- titles to Fig. 13-18, 25-30

Fig. 10b particularly aids reader understanding of distributional differences, and this reviewer is appreciative of its addition, along with other ridgeline plots and the accompanying assessment of their merits.

The authors’ thoughtful and comprehensive revision of this paper merits its publication.

A note to the editor: It would aid ease of reading for figures and their captions to be included in-context within the paper, with paper body wrapped around. If PLOS intends to increasingly publish content related to data visualization (which would be in the scientific interest), this is a recommended amendment to the paper layout criteria.

Reviewer #4: General comments:

The authors introduce the Mirrored Density plot as a method to automate the visualisation of univariate densities, with a focus on the case where many features from the same dataset need to be visualised. The authors rightly point out that in a situation where the distributions of many variables need to be inspected as part of an exploratory analysis it is crucial that visualisation tools provide robust defaults that avoid producing misleading plots for a wide variety of distributions.

At the core of this manuscript is the authors’ argument that their MD plot, which uses Pareto Density Estimation to obtain a density estimate, is superior to other commonly used visualisations, like the ridgeline, violin, or bean plot. While the argument is generally well presented and the authors offer several examples that are well suited to illuminate the differences between the various visualisation techniques, there is a key point the authors appear to be missing. The process of visualising the distribution of a univariate variable consists of two main steps, density estimation and visualisation. The authors make a convincing argument that PDE is better suited to the task then other commonly used methods, as it doesn’t rely on the user choosing appropriate parameters. However, the authors conflate the issue of density estimation with the visualisation by equating different visualisation approaches with the default estimation techniques offered by the implementations used in the comparison.

The fact that PDE is well suited to the task isn’t particularly surprising but making it readily available for data visualisations is indeed a useful contribution. Considering that the primary contribution of this manuscript is relating to data visualisation (rather than density estimation) I am surprised that they do not offer a more systematic discussion of the relative merits of the different visualisation methods included in the comparison. As I see it the major features that distinguish these plots are

1. Horizontal vs vertical display

2. Presence or absence of a rug

3. Whether density estimates are displayed beyond the range of the data

4. Whether the density estimate is mirrored to create a symmetric display

The authors have chosen a particular combination of these features but do not articulate clearly why they believe this to be desirable nor do they provide any evidence that this particular visualisation (as opposed to density estimation) is superior to others. In fact, it seems to me that the MD plot is essentially a violin plot with different default density estimation.

Detailed comments:

1. The introduction contains several references to histograms that seem less relevant now that histograms have replaced by ridgeline plots for the purpose of the comparison. It would be helpful to shift the focus to ridgeline plots earlier. On pages 3 and 8, it is stated that the comparison will include histograms but there is no mention of the ridgeline plot.

2. I agree that the naming of the plot has potential for confusion with the existing plot of a similar name. The authors may wish to consider whether an alternative name would suit them better. I would, however, discourage arguments about which of the two methods is more appropriately named in the manuscript.

3. The quality of written English in the manuscript is generally acceptable but could be improved in a few places. I would encourage the authors to follow through on their plan to obtain assistance in editing the manuscript prior to publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Jane L. Adams

Reviewer #4: Yes: Peter Humburg

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Oct 14;15(10):e0238835. doi: 10.1371/journal.pone.0238835.r004

Author response to Decision Letter 1

20 May 2020

Dear Editors Dr Fatemeh Vafaee and Dr David Mayerich,

thank you for having handled our manuscript entitled “Analyzing the Fine Structure of Distributions” and giving us the chance to modify it in order to accommodate the reviewer’s and editor’s comments.

We have addressed the comments as detailed out in the following (our responses below the reviewers’ comments in red letters).

PONE-D-19-19081R1

Analyzing the Fine Structure of Distributions

PLOS ONE

Dear Dr. rer. nat. Thrun,

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

Please include the following items when submitting your revised manuscript:

• A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

• A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

• An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

outsourc

Kind regards,

Dr Fatemeh Vafaee and Dr David Mayerich

Academic Editors

PLOS ONE

Journal requirements:

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

2. 2. Thank you for stating the following financial disclosure:

'No'

c) Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

The authors received no specific funding for this work.

3. Thank you for stating the following in your Competing Interests section:

'No'

The authors state hereby that they have no competing interests.

4. Please ensure that you refer to Figure 7 in your text as, if accepted, production will need this reference to link the reader to the figure.

The manuscript was reformatted accordingly to the suggestions in 1, citations use now the PLOS endnote style and the referencing of the figures is corrected now.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: (No Response)

Reviewer #4: (No Response)

________________________________________

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #4: Yes

________________________________________

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #4: Yes

________________________________________

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #4: Yes

________________________________________

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #4: Yes

________________________________________

6. Review Comments to the Author

There are a few minor formatting and language changes needed before publication:

Thanks, this was corrected to:

“This work concentrates on visualizing the estimated probability density distribution (PDF) which will be called the distribution of the variable”

“

Thank you for the valid hint. The noted errors were corrected and professional editing was acquired. It is marked in using the review modus in Word.

Please add:

- label to y-axis in Fig. 1a

- y-axis values to Fig. 20, 21

- titles to Fig. 13-18, 25-30

This is corrected in the revised manuscript.

The authors’ thoughtful and comprehensive revision of this paper merits its publication.

The authors are very grateful to the reviewer for the tedious work making this manuscript way better than it was before.

Reviewer #4: General comments:

The authors are grateful for this valid remark demonstrating that different views on this matter exist. To the discussion the following paragraph was added:

“Usually density estimation and visualization approaches are investigated separately from each other. Instead, the authors conflate the issue of density estimation with visualization following the perspective of Tufte, Wilk and Tukey that a graphical representation itself can be used as an instrument for reasoning about quantitative information (8, 53) (p.53). “

To the introduction the following sentence was added:

“On the other hand, “wisely used, graphical representations can be extremely effective in making large amounts of certain kinds of numerical information rapidly available to people” [8], p. 375.”

Please not that in the sentence before, “Moreover” was changed to “On the one hand”. To the conclusion was one sentence added:

“On the other hand, the effective laying open of the data to display the unanticipated, is a major portion of data analysis [8], p. 371.”

Please not that in the sentence before “On the one hand” was added. The authors hope, that these changes indicate that different views on this matter are acceptable.

As I see it the major features that distinguish these plots are

1. Horizontal vs vertical display

The authors wish to thank the reviewer for the chance to improve the manuscript significantly. The discussion states now:

Literature suggests that schematic plots should be wider than they are tall because such shapes usually make it easier for the eye form left to right [2] (p. 129). Small multiples of the type of schematic plots usually present several features with the same graphical design structure at once. Tufte suggests that “If the nature of the data suggests the shape of the graphic follow the suggestion” [52]. Therefore, in the opinion of the authors, the vertical display of box plots [2] should be preferred to the horizontal counterpart of range parts [6], and other schematic plots such as violin plots [3] should be displayed vertically.

2. Presence or absence of a rug

The discussion states now:

Kampstra proposed adding a rug (1D scatter plot) to the violin plot in the bean plot [4]. On the one hand, plotting points in a marginal distribution can easily be misleading [53] (Fig. 1), and the general recommendation is that “the number of information-carrying dimensions (variable) depicted should not exceed the number of dimensions in data” [52] (p.71). On the other hand, if only a handful of unique values are present in the data, then density estimation is inappropriate. Thus, the MD plot does not overlay the density estimation with the 1D scatter plot. Instead, it switches automatically to 1D jittered scatter plots if density estimation results in one or more Dirac delta distributions (e.g., SI F, Fig. 31). The scatter plots are jittered, allowing for a minor indication of the amount of data having one unique value.

3. Whether density estimates are displayed beyond the range of the data

The discussion states now:

The general recommendation is “the larger the share of graphics ink devoted to data, the better, if other relevant matters being equal” [52], (p 96). Tukey and Wilk suggest to avoid undue complexity of form in summarizing and displaying [8], p. 377. Tuftte strongly argues to “erase non-data-ink within reason” [52] (p.96). Hence, the tails of violin-like schematic plots should never extend past the range of data.

4. Whether the density estimate is mirrored to create a symmetric display

Again the authors are very grateful for this valid point. The discussion now states:

“One of the key factors of graphical integrity is to show data variation and not design variation [52]. The schematic plots investigated here have the goal of visualizing such variation by density estimation. Nonsymmetric displays are more useful in the specific task of comparing pairs of distributions to each other. Although bilateral symmetry doubles the space consumed in a graphic without adding new information, redundancy can give context and order to complexity, facilitating comparisons over various parts of data [52] (p.98). The MD plot has the goal of making it easy to compare PDFs, which are often complex. It follows that by using a symmetrical display, clipping, skewness and multimodalities are better visible in data in contrast to nonsymmetrical displays if the body of the symmetric line defined by density estimation is filled out”.

In fact, it seems to me that the MD plot is essentially a violin plot with different default density estimation.

The authors agree with the reviewer that this issue could be elaborated more in the revised manuscript. Two features of the MDplot were not mentioned in the second draft of the manuscript because the authors thought that a discussion of such thresholds are out of the scope of this manuscript. Now the revised manuscript mentions these features in the methods section:

“The MD plot performs no density estimation below a threshold defining the minimal amount of unique data. Instead, a 1D scatter plot (rug plot) is visualized in which for each unique value, the points are jittered on the horizontal (y-)axis to indicate the number of points per unique value. Another threshold defines the minimal amount of values in the data below which a 1D scatter plot is presented instead of a density estimation. The default setting of both thresholds can be changed or disabled by the user if necessary. These thresholds are advantageous in case of a varying amount missing data per feature or if the benchmarking of algorithms yields quantized error states in specific cases (SI F, Fig. 31).”

To the discussion now the following statements are added:

“In addition to the simple density estimation of several features at once, the MD plot offers several advantages. 1D-scatter plots below a threshold proved very helpful for the benchmarking of clustering algorithms because in several cases, the performance evaluation yielded discrete states (see SI F, Fig 31). To the knowledge of the authors, this has yet to be reported in the literature. The MD plot allows us to investigate distributions after common transformations such as robust normalization and the overlaying of distribution with robustly estimated Gaussians. The usage of transformations is often astonishingly effective [8], p. 376. For example, using the robust transformation in combination with this type of overlaying increased the sensitivity of the tendency that a dataset possesses cluster structures compared to usual statistical testing of the 1st principal component [54]. Wilk and Tukey argued to “plot the results of analysis as a routine matter” [8], p.380, for which the MD plot can be useful tool. For example, ordering features by distribution shapes proved to be helpful if the performance of classifiers is evaluated by cross-validations [55]. If the advantages are combined with the ggplot2 syntax, they provide detailed error probability comparisons [56] with a high data to ink ratio (c.f. [52] (p. 96).

And a new figure is provided in the supplementary SI F:

“SI F: Exemplary Benchmarking of Cluster Algorithms

Fig. 31.:MD plot for the results the error rate for ten clustering methods are shown on the example of the Lsun3D dataset [58]. It is clearly visible that KM and FKM have two quantized error states contrary to PBC, ProClus and Orclus, for which density has to be estimated. Other methods can only be described by a Dirac delta distribution indicated by a line. The method KM and KM-ID12 differ in the initialization procedure. Abbreviations: KM (k-means), KM-ID12 (specific Initialization procedure), RKM (Reduced k-means), FKM (Factorial k-means), PPC (Projection Pursuit Clustering) with either MD (MinimumDensity), MaximumClusterbility (MC) or NormalisedCut (NC).”

Detailed comments:

Thanks for the remark. Ridge-line plots are now added in the introduction. For completeness. Now also range bars and notched box plots are mentioned shortly:

“ If the goal is to evaluate many features simultaneously, four approaches are of particular interest: the Box-Whisker diagram (box plot) [2], the violin plot [3], the bean plot [4] and the ridgeline plot [5]. The counterparts of the box plot are the range bar [6], and its extension to the notched box plot [7] is nearly unable to visualize multimodality [2]; therefore, it is disregarded in this work“

On pages 3 and 8, it is stated that the comparison will include histograms but there is no mention of the ridgeline plot.

This is corrected now.

The authors follow the suggestion of the reviewer and deleted the following paragraph:

“It should be noted that there exists an ambiguity in the naming because of the existence of “Mirror Density” plots, a graphical representation of bivariate distributions (e.g., https://www.d3-graph-gallery.com/graph/density_mirror.html). However, maximum likelihood plots can be more informative for such a use case (e.g. (43)). In the opinion of the authors, the name “Mirrored Density” plot (MD plot) is more specific than the “Mirror Density” plot because the density estimation is univariate with a graphical representation as a “line” with a symmetrical reflection of the same information filled out with a color instead of using a bivariate density estimation which does not mirror the “line” of the plot.“

However, the reviews process took a long time due to issues for which neither reviewers, current editors, nor authors were responsible. In the passed time, several high-ranking publications were published or are in the process of being published in which the plot with the current naming is used and essential (e.g. see 9). Therefore, the renaming of the plot is unfeasible.

10)

The authors followed the suggestion of the reviewer and acquired the service of nature springer for language editing. These changes are marked with the review modus in word.

________________________________________

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Jane L. Adams

Reviewer #4: Yes: Peter Humburg

We hope that the manuscript now meets the criteria for publication in PLOS ONE. We are very much looking forward to hearing from you.

Yours sincerely

Alfred Ultsch, Tino Gehlert and Michel Thrun

PLoS One. doi: 10.1371/journal.pone.0238835.r005

Decision Letter 2

Fatemeh Vafaee

14 Jul 2020

PONE-D-19-19081R2

Analyzing the Fine Structure of Distributions

PLOS ONE

Dear Dr. Thrun,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. We apologize for the delay in getting back to you as we had difficulty in finding enough reviewers for the revised version of your manuscript. Not all the initial reviewers were available to review the revised version and finding a reviewer with relevant expertise who would accept to review has taken a long time.

Nonetheless, we invite you to submit a revised version of the manuscript that addresses the points raised by Reviewer #5. Please submit your revised manuscript by Aug 28 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Fatemeh Vafaee, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #4: All comments have been addressed

Reviewer #5: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #4: Yes

Reviewer #5: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #4: Yes

Reviewer #5: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #4: Yes

Reviewer #5: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #4: Yes

Reviewer #5: Yes

**********

6. Review Comments to the Author

Reviewer #4: (No Response)

Reviewer #5: The authors present a variant of the violin plot, termed “mirrored density” plot, which is intended to provide users with a more useful depiction of the underlying univariate distribution for the purposes of data exploration. The authors correctly highlight the fact that the default parameters for many popular packages may not be suitable for data exploration purposes as they are not sensitive enough to the fine structure of the data. The mirrored density plot is proposed to address this shortcoming of existing visualization software by utilizing Pareto Density Estimation for the estimation of univariate probability densities.

In order to argue for the adoption of mirrored density plots, the authors present a series of experiments on both simulated and real datasets in which mirrored density plots are compared to violin, ridgeline, and bean plots. Statistical tests were performed on simulated datasets to test for the presence of certain assumed/designed features (i.e bimodality and/or skewness), and plots were qualitatively inspected for agreement with these tests.

I commend the authors for reproducing their work in Python in addition to R to ensure that performance is not implementation dependent. Reproducibility is in important concern, and it is good to see the authors taking measures to ensure consistency.

There are several issues that I feel need to be addressed:

1) It is unclear why vioplot was used as the representative package for violin plots. ggplot2 is more widely used and accepted within the R community (155K monthly downloads vs 9K – although ggplot does a lot more than violin plots to be fair).

Furthermore, the underlying functionality for MDplot is provided by ggplot2’s violin plot (https://github.com/Mthrun/DataVisualizations/blob/bc76a8c6dc737cb5c593479a534ef2a5b60b330e/R/ClassMDplot.R#L148), so it seems strange not to use this package for comparison.

Please see Replication_Exp1_Fig3.svg. This figure shows the mirrored density plot overlayed with two violin plots from ggplot2. The green outline was produced with default parameters, and the red line with the minor adjustments which will be described below. When using ggplot2’s violin plot, multimodality is clearly visible when the second mean is 2.4 or 2.5 unlike the plots produced by vioplot.

Please see Replication_Uniform_Fig1.svg. This is a replication of the 1000 uniform samples figure. Again, 2 violin plots are presented on top of the MD plot. The green line is with default parameters. The red line, which almost exactly matches the MD plot, uses kernel=’rectangular’, and adjust=’0.8’. I understand that there is an argument for providing useful default parameters, but I am not convinced it warrants an entirely new package.

The use of vioplot instead of ggplot2 is largely responsible for the author’s claim that “Violin plots in R were not able to visualize the bimodality, which was surprising.”

2) “Statistical testing indicated that the ridgeline plot, bean plot, and MD plot have a similar sensitivity regarding bimodality and skewness as long as the sample is large enough.” – This is a gross misrepresentation of the statistical testing performed in this work. The statistical tests referred to were intended to test for the presence of bi-modality or skewness in the simulated datasets. They do not assess the performance of plotting methods. As such, this should not be taken as statistical evidence supporting MDplot’s performance. This paper is ultimately a qualitative comparison of methods and should be treated as such.

3) There is no justification/discussion regarding sample sizes in the simulated datasets which seem to have been chosen arbitrarily. Why were 1000 samples included for the uniform example, 15500 for multi-modality, and 15000 for skewness?

It would also be valuable to see how each method performs at various sample-sizes as not all data exploration takes place with such a large sample size. The smaller/real dataset experiments do not address this question as the “ground-truth” behind the structure is ultimately unknown.

4) There is no discussion surrounding limitations/shortcomings of the work. It is important to provide this information for potential users so they can make a well-informed decision about whether this package is appropriate for their data. I strongly recommend a discussion surrounding the shortcomings of the qualitative nature of this work. Quantitative comparisons are possible for this sort of work – for example, blind-surveys could be conducted to see whether individuals can detect underlying structure from the plots alone (or whether they detect structure which is not there). Furthermore, there is no discussion surrounding the tendency of this method to over-fit to the data.

Minor corrections:

• Throughout the manuscript, both in-text and in-figures, when referring to a normal distribution, m and sd should be replaced with µ and σ respectively (e.g Fig 3b).

• The authors should attempt to install their package (DataVisualizations) on a clean installation of R. It does not properly install the required packages. These packages have to be added manually.

• It would be nice to have figures either superimposing the MD/violin/bean plots or showing them side by side for an easy visual comparison.

• When referring to the Skewed normal distribution, you should use SN, and not N, to avoid confusion with the actual normal distribution (e.g fig 5b).

• Plots should be formatted consistently. For example, titles of some are bolded (5b) while the others are not (5a).

• Fig 6a uses the naming “beanplot” whereas the rest of the paper uses “bean plot”

• In “Given a feature in the data space, there are several approaches for evaluating univariate structures using the indications of the quantity and range of values, e.g., quantile-quantile plots” – e.g. should just be spelled out as “for example”

• “The counterparts of the box plot are the range bar [7], and its extension to the notched box plot [8] is nearly unable to visualize multimodality [3]; therefore, it is disregarded in this work. ” – I think you mean to say “The boxplot and it’s counterparts (i.e range bar and notched box) are unable to visualize multimodality and are therefore disregarded in this work.”, but I’m not sure.

• The following quote is missing and ending quote: “Pareto density estimation (PDE), the radius for hypersphere density estimation is chosen optimally w.r.t information theoretic ideas [28].

• W.r.t (with respect to) above should probably be spelled out for clarity. Square brackets can be used to indicate a quote has been altered.

• There are several places where the phrase “ridgeline plot, violin plot and bean plot” is used. I would suggest changing to “ridgeline, violin, and bean plots” for brevity. Should this suggestion be ignored, then “Although the Python ridgeline and the violin plot use density estimators implemented in different packages” should be made consistent

• Plot in the quote above should be plots – same correction applies to “in contrast to the histogram and MD plot.”

• Remove the first comma in “The results show that the MD plot is the only schematic plot, which is appropriate for every case and does not require adjustments to its process of density estimation by various parameters” (currently it reads as the MD is the only schematic plot, which is isn’t).

• In “Using web scraping, the information of n=269 cases was extracted.” replace was with were.

Overall, the English is good, however, there are minor typos and punctuation mistakes scattered throughout. I acknowledge that this was professionally vetted by nature/springer for language editing, but they missed several mistakes.

Further notes regarding the similarity of “rectangle” kernel estimation to the Pareto Density Estimation approach:

• I have included another comparison of violin and md plots applied to uniformly sampled data (small_uniform_sample.svg). Again, I was able to get quite a similar result using the “rectangle” option for kernel density estimation.

• In all fairness, this required a smaller bandwidth factor (adjust was set to 0.5 instead of 0.8).

• This suggests that there may be an argument to be made in support of PDE as it does a better job showing the fine-grained structure of the data.

• Furthermore, I concede that these plots taper off towards the end which may be misleading to end users.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Oct 14;15(10):e0238835. doi: 10.1371/journal.pone.0238835.r006

Author response to Decision Letter 2

17 Jul 2020

Responses to reviewers are written in the Rebuttal attached as the cover letter.

Attachment

Submitted filename: Response to Reviewers 1.docx

Click here for additional data file.^{(46.2KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0238835.r007

Decision Letter 3

Fatemeh Vafaee

4 Aug 2020

PONE-D-19-19081R3

Analyzing the Fine Structure of Distributions

PLOS ONE

Dear Dr. Thrun,

Please submit your revised manuscript by Sep 18 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Fatemeh Vafaee, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

Comment from Editor: I appreciate your effort in improving the manuscript as per reviewers' comments; before accepting the paper, please address minor comment raised by the Reviewer and review the manuscript for English quality making sure that there is no grammatical error and improve figures' quality whenever possible.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #5: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #5: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #5: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #5: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #5: (No Response)

**********

6. Review Comments to the Author

Reviewer #5: I thank the authors for taking the time to consider my recommendations, and I am reasonably satisfied with how they have been addressed. In particular, I am pleased to see a discussion surrounding method limitations and a revision of the interpretation of statistical testing.

Although it would arguably have been more appropriate to compare MD plots to geom_violin (instead of vioplot) in the main figures, the authors have included in-text references to SI F which does make these comparisons and noted the fact that geom_violin is capable of detecting bi-modality.

They have noted that they are happy to improve figures/grammar upon acceptance, so I will leave it to the editor to make a decision regarding this matter.

I will include one small nitpick however. The authors replaced all occurrences of "e.g" with "for example". My apologies for not being more clear with my comments. I was only suggesting that the one instance of e.g be replaced with for example as it was in-sentence. It is still appropriate (and probably preferable) to make use of "e.g" inside parenthesis (e.g here).

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Oct 14;15(10):e0238835. doi: 10.1371/journal.pone.0238835.r008

Author response to Decision Letter 3

25 Aug 2020

We asked a second native speaker with appropriate specialization and professional experience to revise the grammar of our manuscript after explaining in detail the content with the goal that corrections are not only grammatically correct but also represent the correct meaning. As we had excellent experience from previous work (https://www.springer.com/gp/book/9783658205393) we hope that the English quality is now considerably improved. All figures have been computed again and then post-processed with Adobe Photoshop. Every change is marked except for supplementary F for which due to technical reasons (Rmarkdown script) the figures could not be post-processed and grammatical corrections are not marked.

The figures now have all the same spelling with regards to the schematic plots, the same font for titles and axis, and as far as possible, the same font size. The various Figures of 1, 3, 5, 6, 8 were combined in two one illustration in which it is noted which subfigure is a), b) and so on. This was achieved via post-processing using Adobe Photoshop. The whole manuscript was revised with regards to grammar.

We are sorry that we did not understand the Reviewer correctly. “e.g.,” is now used inside of parenthesis. The term “for example” was sometimes revised by the native speaker to other options and sometimes remained if there were nor parenthesis. Detail changes are marked in the manuscript.

All figures besides SI F have been uploaded to PACE again and tested there.

PLoS One. doi: 10.1371/journal.pone.0238835.r009

Decision Letter 4

Fatemeh Vafaee

26 Aug 2020

Analyzing the Fine Structure of Distributions

PONE-D-19-19081R4

Dear Dr. Thrun,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Fatemeh Vafaee, Ph.D.

Academic Editor

PLOS ONE

PLoS One. doi: 10.1371/journal.pone.0238835.r010

Acceptance letter

Fatemeh Vafaee

28 Aug 2020

PONE-D-19-19081R4

Analyzing the Fine Structure of Distributions

Dear Dr. Thrun:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Fatemeh Vafaee

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File. ITS and MTY.

(DOCX)

Click here for additional data file.^{(263.2KB, docx)}

S2 File. Descriptive statistics.

(DOCX)

Click here for additional data file.^{(18.9KB, docx)}

S3 File. Conventional violin plot in Python.

(DOCX)

Click here for additional data file.^{(5.6MB, docx)}

S4 File. Overlayed histograms.

(DOCX)

Click here for additional data file.^{(1.3MB, docx)}

S5 File. Density and ridgeline plots in Python.

(DOCX)

Click here for additional data file.^{(1.5MB, docx)}

S6 File. Violin plot of ggplot2.

(PDF)

Click here for additional data file.^{(919.9KB, pdf)}

S7 File

(DOCX)

Click here for additional data file.^{(359.3KB, docx)}

Attachment

Submitted filename: Rebuttal_V2.docx

Click here for additional data file.^{(43.9KB, docx)}

Attachment

Submitted filename: Response to Reviewers 1.docx

Click here for additional data file.^{(46.2KB, docx)}

Data Availability Statement

Data is attached to packages in R: https://CRAN.R-project.org/package=DataVisualizations and in python: https://pypi.org/project/md-plot/

[pone.0238835.ref001] 1.Michael JR. The stabilized probability plot. Biometrika. 1983;70(1):11–7. [Google Scholar]

[pone.0238835.ref002] 2.Wilk MB, Gnanadesikan R. Probability plotting methods for the analysis for the analysis of data. Biometrika. 1968;55(1):1–17. [PubMed] [Google Scholar]

[pone.0238835.ref003] 3.Tukey JW. Exploratory data analysis. Mosteller F, editor. United States Addison-Wesley Publishing Company; 1977. 688 p. [Google Scholar]

[pone.0238835.ref004] 4.Hintze JL, Nelson RD. Violin plots: a box plot-density trace synergism. The American Statistician. 1998;52(2):181–4. [Google Scholar]

[pone.0238835.ref005] 5.Kampstra P. Beanplot: A boxplot alternative for visual comparison of distributions. Journal of Statistical Software, Code Snippets. 2008;28(1):1–9. 10.18637/jss.v028.c01 [DOI] [Google Scholar]

[pone.0238835.ref006] 6.Wilke CO. Fundamentals of data visualization: a primer on making informative and compelling figures: O'Reilly Media; 2019.

[pone.0238835.ref007] 7.Spear ME. Charting statistics. New York: McGraw-Hill; 1952. [Google Scholar]

[pone.0238835.ref008] 8.McGill R, Tukey JW, Larsen WA. Variations of box plots. The American Statistician. 1978;32(1):12–6. [Google Scholar]

[pone.0238835.ref009] 9.Tukey JW, Wilk MB. Data analysis and statistics: techniques and approaches. The quantitative analysis of social problems. 1970:370–90. [Google Scholar]

[pone.0238835.ref010] 10.Wickham H, Stryjewski L. 40 years of boxplots. Am. Statistician: 2011. [Google Scholar]

[pone.0238835.ref011] 11.Ultsch A. Pareto density estimation: A density estimation for knowledge discovery. In: Baier D, Werrnecke KD, editors. Innovations in classification, data science, and information systems. Proceedings o f the 27th Annual Conference of the Gesellschaf für Klassifikation. 27. Berlin, Germany: Springer; 2005. p. 91–100.

[pone.0238835.ref012] 12.Hartigan JA, Hartigan PM. The dip test of unimodality. The annals of Statistics. 1985;13(1):70–84. [Google Scholar]

[pone.0238835.ref013] 13.Freeman JB, Dale R. Assessing bimodality to detect the presence of a dual cognitive process. Behavior research methods. 2013;45(1):83–97. 10.3758/s13428-012-0225-x [DOI] [PubMed] [Google Scholar]

[pone.0238835.ref014] 14.D'Agostino RB. Transformation to normality of the null distribution of g1. Biometrika. 1970;57(3):679–81. [Google Scholar]

[pone.0238835.ref015] 15.Thrun MC. Projection Based Clustering through Self-Organization and Swarm Intelligence. Heidelberg: Springer; 2018. [Google Scholar]

[pone.0238835.ref016] 16.Lötsch J, Ultsch A. Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data. International Journal of Molecular Sciences. 2020;21(1):79. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238835.ref017] 17.Thrun MC. Knowledge Discovery in Quarterly Financial Data of Stocks Based on the Prime Standard using a Hybrid of a Swarm with SOM In: Verleysen M, editor. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN); 24–26 April Bruges, Belgium: Ciaco, 978-287-587-065-0; 2019. p. 397–402. [Google Scholar]

[pone.0238835.ref018] 18.Hyndman RJ, Fan Y. Sample quantiles in statistical packages. The American Statistician. 1996;50(4):361–5. [Google Scholar]

[pone.0238835.ref019] 19.Murtagh F, Legendre P. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? Journal of classification. 2014;31(3):274–95. [Google Scholar]

[pone.0238835.ref020] 20.Wilkin GA, Huang X, editors. K-means clustering algorithms: implementation and comparison. Second International Multi-Symposiums on Computer and Computational Sciences (IMSCCS 2007); 2007: IEEE.

[pone.0238835.ref021] 21.Linde Y, Buzo A, Gray R. An algorithm for vector quantizer design. IEEE Transactions on communications. 1980;28(1):84–95. [Google Scholar]

[pone.0238835.ref022] 22.Ultsch A, Thrun MC, Hansen-Goos O, Lötsch J. Identification of Molecular Fingerprints in Human Heat Pain Thresholds by Use of an Interactive Mixture Model R Toolbox (AdaptGauss). International journal of molecular sciences. 2015;16(10):25897–911. 10.3390/ijms161025897 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238835.ref023] 23.Zhang C, Mapes BE, Soden BJ. Bimodality in tropical water vapour. Quarterly Journal of the Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorology and physical oceanography. 2003;129(594):2847–66. [Google Scholar]

[pone.0238835.ref024] 24.Bryson MC. Heavy-tailed distributions: properties and tests. Technometrics. 1974;16(1):61–8. [Google Scholar]

[pone.0238835.ref025] 25.Levy M, Solomon S. New evidence for the power-law distribution of wealth. Physica A: Statistical Mechanics and its Applications. 1997;242(1–2):90–4. [Google Scholar]

[pone.0238835.ref026] 26.Alstott J, Bullmore E, Plenz D. powerlaw: a Python package for analysis of heavy-tailed distributions. PloS one. 2014;9(1):e85777 10.1371/journal.pone.0085777 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238835.ref027] 27.Jordanova PK, Petkova MP, editors. Measuring heavy-tailedness of distributions. AIP Conference Proceedings; 2017: AIP Publishing.

[pone.0238835.ref028] 28.Royston P. Which measures of skewness and kurtosis are best? Statistics in Medicine. 1992;11(3):333–43. 10.1002/sim.4780110306 [DOI] [PubMed] [Google Scholar]

[pone.0238835.ref029] 29.Ferreira JTS, Steel MFJ. A constructive representation of univariate skewed distributions. Journal of the American Statistical Association. 2006;101(474):823–9. [Google Scholar]

[pone.0238835.ref030] 30.Scott DW. Multivariate density estimation: theory, practice, and visualization: John Wiley & Sons; 2015. [Google Scholar]

[pone.0238835.ref031] 31.Keating JP, Scott DW. A primer on density estimation for the great homerun race of 1998. STATS. 1999;25:16–22. [Google Scholar]

[pone.0238835.ref032] 32.Sievert C, Parmer C, Hocking T, Scott C, Ram K, Corvellec M, et al. plotly: Create Interactive Web Graphics via 'plotly.js'. 4.7.1 ed. CRAN2017. p. R package.

[pone.0238835.ref033] 33.Benjamini Y. Opening the box of a boxplot. The American Statistician. 1988;42(4):257–62. [Google Scholar]

[pone.0238835.ref034] 34.Bowman AW, Azzalini A. Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations. New York, United States: Oxford University Press; 1997. [Google Scholar]

[pone.0238835.ref035] 35.Adler D. vioplot: Violin plot. 0.2 ed2005. p. R package.

[pone.0238835.ref036] 36.Bowman AW, Azzalini A. R package sm: nonparametric smoothing methods. 2.2–5.4 ed. University of Glasgow, UK and Universit `a di Padova, Italia2014. p. http://www.stats.gla.ac.uk/~adrian/sm, http://azzalini.stat.unipd.it/Book_sm.

[pone.0238835.ref037] 37.Team RC. R: A Language and Environment for Statistical Computing. Vienna, Austria2018. p. {R Foundation for Statistical Computing.

[pone.0238835.ref038] 38.Venables WN, Ripley BD. Modern Applied Statistics with S. Fourth Edition ed. Chambers J, Eddy W, Härdle W, Sheather S, Tierney L, editors. New York: Springer; 2002. 501 p. [Google Scholar]

[pone.0238835.ref039] 39.Wickham H. ggplot2. Wiley Interdisciplinary Reviews: Computational Statistics. 2011;3(2):180–5. [Google Scholar]

[pone.0238835.ref040] 40.Sheather SJ, Jones MC. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society Series B (Methodological). 1991:683–90. [Google Scholar]

[pone.0238835.ref041] 41.Wilke CO. ggridges: Ridgeline Plots in 'ggplot2. 0.5.1 ed2018. p. R package.

[pone.0238835.ref042] 42.Waskom M, Botvinnik O, Hobson P, Cole JB, others, Allan D. seaborn: v0.5.0 (November 2014). 2001. p. Python package.

[pone.0238835.ref043] 43.Jones E, Oliphant T, Peterson P, others. {SciPy}: Open source scientific tools for {Python}. 2001. p. Python package.

[pone.0238835.ref044] 44.Racine JS. Nonparametric econometrics: A primer. Foundations and Trends® in Econometrics. 2008;3(1):1–88. [Google Scholar]

[pone.0238835.ref045] 45.Sheppard K, pktd J, Brett M, Gommers R, Seabold S. statsmodels 0.10.2. 2019. p. Python package.

[pone.0238835.ref046] 46.Thrun MC, Ultsch A. Effects of the payout system of income taxes to municipalities in Germany. In: Papież M, Śmiech S, editors. 12th Professor Aleksander Zelias International Conference on Modelling and Forecasting of Socio-Economic Phenomena; Cracow, Poland: Cracow: Foundation of the Cracow University of Economics; 2018. p. 533–42.

[pone.0238835.ref047] 47.Gehlert T. md_plot: A Python Package for Analyzing the Fine Structure of Distributions. 2019. p. Python package.

[pone.0238835.ref048] 48.Milligan GW, Cooper MC. A study of standardization of variables in cluster analysis. Journal of classification. 1988;5(2):181–204. [Google Scholar]

[pone.0238835.ref049] 49.Thrun MC, Ultsch A, editors. Models of Income Distributions for Knowledge Discovery. European Conference on Data Analysis; 2015; Colchester.

[pone.0238835.ref050] 50.Thrun MC, Hansen-Goos O, Griese R, Lippmann C, Lerch F, Lötsch J, et al. AdaptGauss. 1.3.3 ed. Marburg2015. p. R package.

[pone.0238835.ref051] 51.Fernández C, Steel MF. On Bayesian modeling of fat tails and skewness. Journal of the American Statistical Association. 1998;93(441):359–71. [Google Scholar]

[pone.0238835.ref052] 52.Ultsch A, Behnisch M. Effects of the payout system of income taxes to municipalities in Germany. Applied Geography. 2017;81:21–31. [Google Scholar]

[pone.0238835.ref053] 53.Prime-Standard. Teilbereich des Amtlichen Marktes und des Geregelten Marktes der Deutschen Börse für Unternehmen, die besonders hohe Transparenzstandards erfüllen.: Deutsche Börse; 2018 [18.09.2018]. Available from: http://deutsche-boerse.com/dbg-de/ueber-uns/services/know-how/boersenlexikon/boersenlexikon-article/Prime-Standard/2561178.

[pone.0238835.ref054] 54.Yahoo! Finance. Income statement, Balance Sheet and Cash Flow Germany: Microsoft Corp.; 2018 [cited 2018 29.09.2018]. Available from: https://finance.yahoo.com/quote/SAP/financials?p=SAP (Exemplary).

[pone.0238835.ref055] 55.Tufte ER. The visual display of quantitative information: Graphics press Cheshire, CT; 2001. 197 p.

[pone.0238835.ref056] 56.Brier SS, Fienberg SE. Recent econometric modeling of crime and punishment: support for the deterrence hypothesis? Evaluation Review. 1980;4(2):147–91. [Google Scholar]

[pone.0238835.ref057] 57.Thrun MC, Ultsch A. Using Projection based Clustering to Find Distance and Density based Clusters in High-Dimensional Data. Journal of Classification. 2020. 10.1007/s00357-020-09373-2 [DOI] [Google Scholar]

[pone.0238835.ref058] 58.Thrun MC. Improving the Sensitivity of Statistical Testing for Clusterability with Mirrored-Density Plot In: Archambault D, Nabney I, Peltonen J, editors. Machine Learning Methods in Visualisation for Big Data; Norrköping, Sweden: The Eurographics Association; 2020. [Google Scholar]

[pone.0238835.ref059] 59.Hoffmann J, Rother M, Kaiser U, Thrun MC, Wilhelm C, Gruen A, et al. Determination of CD43 and CD200 surface expression improves accuracy of B-cell lymphoma immunophenotyping. Cytometry Part B: Clinical Cytometry. 2020:1–7. 10.1002/cyto.b.21936 [DOI] [PubMed] [Google Scholar]

[pone.0238835.ref060] 60.Thrun MC, Ultsch A. Swarm Intelligence for Self-Organized Clustering. Artificial Intelligence. 2020;in press. 10.1016/j.artint.2020.103237 [DOI]

PERMALINK

Analyzing the fine structure of distributions

Michael C Thrun

Tino Gehlert

Alfred Ultsch

Roles

Abstract

Introduction

Methods

Performance comparison

Table 1. Summary of basic properties of empirical distributions that are interesting for data mining.

Visualization tools

Mirrored Density plot (MD plot)

Results

Fig 1.

Experiment I: Multimodality versus unimodality

Fig 2. Scatterplots of a Monte Carlo simulation in which samples were drawn and testing was performed in a given range of parameters in 100 iterations.

Fig 3.

Experiment II: Skewness versus normality

Fig 4. Scatterplots of a Monte Carlo simulation in which samples were drawn and testing was performed in a given range of parameters in 100 iterations.

Fig 5.

Experiment III: Data clipping versus heavy-tailedness

Fig 6. MTY feature clipped in the range marked in red with a robustly estimated average of the whole data in magenta (left) and not clipped (right).

Experiment IV: Combining multimodality and skewness with data clipping

Fig 8.

Fig 7.

Experiment V: Visual exploration of distributions

Fig 9. MD plots of selected features from 269 companies on the German stock market reporting quarterly financial statements by the prime standard.

Fig 10.

Experiment VI: Range of values depending on features

Fig 11.

Discussion

Conclusion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Qichun Zhang

Roles

Author response to Decision Letter 0

Decision Letter 1

Fatemeh Vafaee

Roles

Author response to Decision Letter 1

Decision Letter 2

Fatemeh Vafaee

Roles

Author response to Decision Letter 2

Decision Letter 3

Fatemeh Vafaee

Roles

Author response to Decision Letter 3

Decision Letter 4

Fatemeh Vafaee

Roles

Acceptance letter

Fatemeh Vafaee

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases