Skip to main content
. 2020 Nov 18;23(12):101819. doi: 10.1016/j.isci.2020.101819
Robustness A measure for how easily outlier values distort results:
  • Average: not robust, a single strong outlier deforms results severely

  • Median: very robust, good results even with almost half of all values being strong outliers

Unbalanced Describes unequal group sizes or missing values, methods assuming balanced groups will have misleading results
Positive skew Asymmetric distribution of data with more small than large values, common in flow cytometry and many other biological measures
Data pre-processing Pre-processing aims to normalize data distribution (i.e. make a bell-shape) by changing all values according to one or several defined mathematical equations
  • All types of pre-processing can be combined with each other

Centering and scaling
Cell count differences are not per se reflective of their biological importance; thus centering and scaling reduces the stark differences of cell numbers between the cell populations to allow comparisons of different cell populations. Are vital for multivariate statistical methods, otherwise results will be dominated by cells with highest counts or highest noise
  • Centering: subtraction of a constant from every value, e.g. the mean

  • Scaling: normalize the range of measured values by dividing with a constant e.g. the standard deviation

  • Can be combined, e.g. centering by mean, scaling by standard deviation is z-scaling

Transformation
  • Convert each measured value by a specific, often nonlinear, but defined mathematical function (e.g. log10(x)) to improve distribution

  • Normal distribution is often a prerequisite for specific statistical methods or allows use of more powerful statistical methods (Keene, 1995; van den Berg et al., 2006)

Data contaminations Denotes all kinds of problematic values in the data, such as sample outliers, single value outliers, or missing values
Outlier A value so different from the rest that it could be for example an analytical error
Univariate or multivariate Univariate methods investigate each measured data on its own (e.g. analyzing only CD3+ T cells irrespective of the 15 other cell populations), whereas multivariate methods analyze multiple/all measured data at once (e.g. all 16 cell populations)
  • Univariate methods can dissect in great detail several biological factors (e.g. treatment, substrain) and their interaction, but cannot directly compare different measured data with each other (e.g. is inflammation on a given driven more by T or B cells?)

  • Multivariate methods allow a holistic comparison of various biological factors and their main drivers (e.g. inflammation at day 3 is strongly driven by PMN and less by CD8+ T cells in BALF Figure 6C) but are limited dissecting several biological factors or their interaction