Abstract
Compositional data are commonly known as multivariate observations carrying relative information. Even though the case of vector or even two-factorial compositional data (compositional tables) is already well described in the literature, there is still a need for a comprehensive approach to the analysis of multi-factorial relative-valued data. Therefore, this contribution builds around the current knowledge about compositional data a general theoretical framework for k-factorial compositional data. As a main finding it turns out that, similar to the case of compositional tables, also the multi-factorial structures can be orthogonally decomposed into an independent and several interactive parts and, moreover, a coordinate representation allowing for their separate analysis by standard analytical methods can be constructed. For the sake of simplicity, these features are explained in detail for the case of three-factorial compositions (compositional cubes), followed by an outline covering the general case. The three-dimensional structure is analyzed in depth in two practical examples, dealing with systems of spatial and time dependent compositional cubes. The methodology is implemented in the R package robCompositions.
Keywords: Analysis of independence, Compositional data, Coordinate representation, Orthogonal decomposition
Introduction
Consider a data set where the relative structure of parts is of interest. As an example, the age structure of all employees in a given country are to be analyzed. In this case, the ratios between the parts shall be considered for the analysis rather than the absolute values, which are mainly influenced by the size of the country and by other external factors. As it will be shown later, this situation has already been studied extensively in the classical framework of compositional data analysis (Aitchison 1982, 1986). However, when the data structure is determined according to more than one factor, e.g. one can study the employment structure from the perspective of age and gender of employees, the classical theory needs to be properly adjusted in order to cope with these more complex structures. The first attempt towards this goal has been considered by the proposal of compositional tables, two-factorial compositions (Fačevicová et al 2018). Nevertheless, also structures formed by three or even more factors are likely to occur in practice. For instance, in addition to gender and age, one could be interested in analyzing the employment structure according to full-time and part-time employment. Therefore, the manuscript introduces a general framework of dealing with multi-factorial compositional data, hereby extending the concepts and developments for compositional tables.
The goal of compositional data analysis is to process data which carry relative information. This resulted in a concise methodology with a wide range of possible applications, see, e.g., Pawlowsky-Glahn et al (2015), Filzmoser et al (2018) and references therein. A D-part composition is defined as a vector with positive components (parts) , where the real information content is in the ratios between these parts rather than directly in the measured absolute values. In other words, compositional data describe quantitatively relative contributions of parts on a whole. Consequently, compositional data are scale invariant and can be represented without any loss of information as observations with a prescribed sum of the parts, e.g. in proportions (sum 1) or percentages (sum 100). Accordingly, the sample space of (representations of) compositional data is traditionally considered to be the D-part simplex . Note that the constant , representing the sum of the compositional parts, can be chosen arbitrarily, and it reduces the dimensionality of the sample space to . Specific features of compositional data, particularly the scale invariance property, are captured by the Aitchison geometry (Pawlowsky-Glahn and Egozcue 2001; Billheimer et al 2015) with Euclidean vector space properties. For compositions and a real constant , these properties result from defining the operations perturbation, powering, and the Aitchison inner product,
| 1 |
respectively, where denotes closure and constitutes the equivalence classes of compositional data, which differ only by the sum of their parts . As a consequence, a direct application of traditional multivariate statistical methods that rely on the Euclidean geometry in the real space (Eaton 1983) is not appropriate. Even though it would be possible to adapt them to the Aitchison geometry, it is more sensible to find a way how to express compositional data isometrically in the -dimensional real space and proceed there, just by taking into account the specific interpretation of the new variables. In the compositional data analysis context this refers to isometric log-ratio (ilr) coordinates (Egozcue et al 2003), which are orthonormal with respect to the Aitchison geometry. The main idea is to find a system of orthonormal basis vectors of , where the new coordinates are obtained as
| 2 |
Since there does not exist a canonical basis on , an option is to use such an ilr coordinate system which has an advantageous interpretation under the given problem setting. From the definition, any ilr coordinate is a log-contrast, i.e. a linear combination with . One popular approach for the construction of orthonormal coordinates was defined in Egozcue and Pawlowsky-Glahn (2005). The aim is to construct a sequence of binary partitions of groups of compositional parts in order to obtain coordinates that are interpretable in terms of balances between these groups of parts. Accordingly, sequential binary partitions (SBP) are based on a systematical splitting of the compositional vector into two non-overlapping subcompositions, and the generating process ends after steps when each subcomposition is formed by only one part. The i-th step of the partition produces one vector of log-contrast coefficients with parts at the positions corresponding to parts from the first subcomposition formed by this step (denoted with ), parts at the positions related to parts from the second subcomposition (denoted with −), and 0 elsewhere. The coefficients are closely linked to basis vectors through the relation , and the resulting coordinates can be obtained from Eq. (2) or, without the need of enumeration of , directly as the log-contrast
| 3 |
or
| 4 |
where the contrast matrix of order has rows formed by . With this notation it is possible to express the ilr coordinates (balances) by
| 5 |
where g(.) stands for the geometric mean, and the values and label the parts from the first and the second subcomposition, respectively. Equation (5) reveals that SBP produces coordinates in form of log-ratios between mean representations of two groups of parts, which has led to the name “‘balances”. Particularly, in the case of vector compositional data, balances allow for a simple and natural interpretation, and for this reason they are frequently used in applications (Pawlowsky-Glahn et al 2015).
Although balances form a flexible class of orthonormal coordinates for vector compositional data, a further challenge is to develop a coordinate representation for the case when the whole is distributed according to two or more factors. The case of two-factorial compositional data (Egozcue et al 2008, 2015), referred to as compositional tables, has been intensively studied in Fačevicová et al (2014, 2016, 2018), and a general coordinate representation has been derived, which enables to decompose compositional tables into independent and interactive parts. Accordingly, the resulting ilr coordinates have the form of balances between two groups of parts (independent part) and log-odds ratios (interactive part), and they respect the dimensionality of the decomposed parts.
More specifically, compositional tables refer to a setting where the relative structure of the data is determined by two factors. Accordingly, not only the relations within each factor, but also relations between them need to be analyzed. As an example consider an employment structure in a given country, distributed according to the age of the employees and their gender. Three types of questions arise: Is the proportion of females among the employees comparable to the proportion of males? Does any of the age groups outbalance? Does the age structure of the employees depend on their gender? The first two questions focus exclusively on one factor, suppressing the effect of the other one. The last question, on the other hand, links information from both factors together. When the within-factor structure is analyzed, the effect of the other factor can be suppressed by averaging across all its levels. This results in a standard compositional vector, and balances are then a natural way of its coordinate representation. Particularly, consider a table formed by two factors, a row factor with I levels, and a column factor with J levels. The whole information about the relations among the I levels of the row factor is preserved in coordinates of the form
| 6 |
where and correspond to the respective step of the SBP performed on the levels of the row factor. The indices and specify the rows, and is the geometric mean. Similarly, relations among the J levels of the column factor are preserved in the balances
| 7 |
constructed with respect to the SBP of the levels of the column factor, where the indices and specify the included columns.
The relations between two factors are traditionally described by odds ratios (Agresti 2002). This concept can be adapted to compositional tables, because the last group of coordinates has the form of log-odds ratios between four groups of parts. These groups are uniquely defined by row and column SBPs and represented by geometrical means of their parts. More specifically, these odds ratio coordinates are given by
| 8 |
for and , where are indices of parts in each group defined by the i-th and j-th step of row and column SBP, respectively, and are the numbers of parts within these groups. The construction and interpretation of this coordinate system is discussed in detail in Fačevicová et al (2018).
An important feature of compositional tables is the possibility of their orthogonal decomposition. In the special situation when there exists no relationship between row and column factors, all parts of the compositional table would be formed by the product of row and column marginals. This leads to the so called independence table. The orthogonal complement to the independence table is called interaction table. Since the previously introduced coordinate system respects this decomposition, the independence table is characterized by row and column balances, and the interaction table by odds ratio coordinates, and it is possible to analyze each part separately.
Even though the structure of compositional tables is already well described in the literature, a comprehensive approach for the analysis of multi-factorial relative-valued data is still lacking. Thus, the framework of compositional tables is extended to a general theory to work with k-factorial compositional data. Besides other findings, it turns out that also the multi-factorial structures can be decomposed orthogonally into an independent and several interactive parts and, moreover, a coordinate representation allowing for their separate analysis is provided. Although all considerations in the next section (Sect. 2) are presented just for the case of three-factorial compositional data (called compositional cubes in the following), they can be easily generalized to the case of more than three factors.
The construction and interpretation of the proposed coordinate system is explained on an illustrative example in Sects. 2.1 and 3, which also introduces the function implemented in the R package robCompositions (Templ et al 2011). The coordinates are used for the analysis of the employment structure of the European OECD countries, and a graphical comparison of the countries as well as a spatial clustering are provided. Moreover, the main sources of differences between the clusters are investigated by robust principal component analysis. Based on an example of Austrian mobility data, a strategy for the analysis of multi-factorial time series is presented in Sect. 4. The final Sect. 5 concludes.
Compositional cubes
In this section we simplify the main findings derived in Fačevicová et al (2018) for the two-factorial situation, and consequently generalize them to the multi-factorial case. Consider a relative structure formed according to three factors with I, J and K levels, respectively. Such a situation can be represented with a compositional cube and written in the form
| 9 |
where . The vertical lines separate the levels of the third factor, called slices in the following. Since compositional cubes form a special case of the concept of -part vector compositional data, all basic definitions can be accommodated for this case.
The sample space of compositional cubes is a subset of the -part simplex
| 10 |
which includes only those -part compositions, which can be recorded into the form of a three-factorial structure with I rows, J columns and K slices. The basic operations of the Aitchison geometry (as given in Pawlowsky-Glahn and Egozcue 2001) modify to
| 11 |
and they result again in a compositional cube. The Euclidean vector space structure of the Aitchison geometry is completed by defining the Aitchison inner product for cubes
| 12 |
Example: employment structure
The proposed approach will be illustrated with an example where the analysis of the employment structure in several countries is of interest. For this purpose, data from 32 European members of OECD were collected at http://stats.oecd.org. For each country in the sample, an estimated number of employees in the year 2015 was available. The data were structured according to gender and age of the employees and the type of their contract. More specifically, we distinguish males (M) and females (F), young (category 15–24), middle-aged (25–54) and older (55 +) employees, and full-time (FT) and part-time (PT) contracts. The data at hand thus form a sample of 32 cubes with two rows (gender), two columns (type of contract) and three slices (age), which allow for a deeper analysis of the overall employment structure, not just from the perspective of each factor separately, but also from the perspective of the relations/interactions between them. Besides the global aspects of the employment, the analysis aims also at revealing the national specifics of the countries contained in the sample. An example of one cube from Czech Republic is displayed in Table 1, and a graphical overview of the cubes is depicted in Fig. 1.
Table 1.
Example of one cube from the sample analyzed in Sections 2.1 and 3: employment structure in the Czech Republic in 2015 (in thousands of employees)
| Gender | 15–24 | 25–54 | ||||
|---|---|---|---|---|---|---|
| FT | PT | FT | PT | FT | PT | |
| Female | 104.756 | 17.128 | 1618.415 | 90.505 | 317.031 | 56.355 |
| Male | 169.851 | 11.165 | 2127.849 | 22.759 | 467.212 | 38.208 |
Fig. 1.

Graphical representation of the cube structure used in Sects. 2.1 and 3. The rows represent the gender of the employees, the columns the type of contract (FT: full-time, PT: part-time), and the slices separate different age groups
Obviously, the counts in the cells of the cubes depend on the population size of the country. When the analysis of structural patterns of the employment in several countries is of interest, the compositional approach appears as appropriate, because the population size is not relevant in this approach.
Decomposition of compositional cubes
Egozcue et al (2008) proposed a decomposition of a compositional table into an independent and an interactive part (still a compositional table), which are mutually orthogonal; their perturbation again leads to the original compositional table. The independent part mimics the independence of the factors. As in the standard case of contingency tables, an assumption of independence means that the whole information about the relative structure of both factors is preserved in the row and column marginals, and each entry of the table can be obtained as their product. In the compositional case, the only difference is that the arithmetic marginals are replaced by the geometric ones. When the factors are not independent, and the compositional table does not equal to the independence one, another table needs to be introduced. The interaction table, simply defined as a residual resulting from the difference between the original and the independence tables in the sense of perturbation, preserves the whole information about the relations between the factors and becomes mainly important when these relations are analyzed. A similar idea can be utilized also in the case of compositional cubes, but due to the presence of pairwise and whole interactions, it is possible to further decompose the interactive part of the cube into additional four cubes, each preserving information about another source of association between the factors.
Similar to the case of compositional tables, also parts of the independence compositional cube are formed by the product of row, column and slice (geometric) marginals
| 13 |
where dots in the index indicate an aggregation over the respective factors. For example, the notation stands for the geometric mean of all parts in the i-th row of the cube, i.e. . In case of perfect independence of all three factors, the original cube would be equal to the independence one. Otherwise, all associations between the factors are preserved in the interactive part
| 14 |
As mentioned above, the interactive part can be further decomposed. First, the relations between row and column factors are analyzed. Aggregation over values of the slice factor eliminates its impact and reduces the three-dimensional structure to a system of K similar compositional tables, forming K slices of a cube. According to Egozcue et al (2008), the interactive part of a table is extracted by a division of its parts by the respective geometric marginals. These considerations result in the compositional cube with cells
| 15 |
From (15) it follows that the is actually formed by K equal slices (compositional tables). Moreover, the row and column geometric marginals (that means compositional tables resulting from an aggregation of cube cells by geometric means across the respective direction) are uniform, which underlines the favorable structure of the proposed decomposition. The system of marginals is completed in the direction of slices, whose respective marginal table corresponds to (15). In order to extract the pure interaction between row and slice factor, the effect of the column factor needs to be filtered out using the geometric mean, and similarly as , the row-slice interaction cube has parts
| 16 |
Similar to the case of , also this cube has uniform marginals. This property holds for the row and slice directions and the marginal table computed across the columns equal to (16). Finally, interactions between column and slice factors are contained in the cube with cells
| 17 |
Since this cube is formed by I identical rows, also row marginals equal to a table with parts (17), however the column and slice marginals are again uniform, i.e., they are composed by the same positive elements. All pairwise interaction cubes are orthogonal, but since there was always one factor omitted from the consideration, the information about the interactive part of the original cube is still not complete. The structure of a compositional cube is completed by considering mutual interactions between all three factors. This corresponds to the cube
| 18 |
with parts
| 19 |
Also this cube has an advantageous structure from the perspective of the marginal tables, which are in this case uniform in all three directions. Note here that a similar property holds also for the interactive part of a compositional table, where row and column marginals are from their construction uniform. Such a decomposition of the multi-factorial data can be very useful for an in-depth analysis of the data structure, as demonstrated in Fačevicová et al (2016, 2018, 2021) for compositional tables. For instance, in the example from Sect. 2.1, the cube preserves interactions between the gender of employees and the type of their contract. The information on relations between these two factors is then completed by , which involves also the effect of age, which was suppressed in . Moreover, from the decomposition it follows that for those interactions, for which the respective cubes in the interaction part are constructed, the respective marginals are uniform, i.e., the information is fully captured by these cubes and does not propagate further. Consequently, vector (one-factorial) geometric marginals occur in the decomposition indeed only in the independent part, as expected, and more-dimensional nontrivial marginals (here in form of compositional tables) are left for bifactorial interaction cubes.
On the other hand, if each part of the decomposition is considered separately, two challenges need to be taken into account. At first, the dimensionality of the original cube sample space is decomposed as well. More specifically, the overall dimensionality turns to for the independence cube , for cube (and similarly for the remaining cubes related to paired interactions) and, finally, the dimensionality of the sample space of cube is . The altered dimensionality, corresponding to each of the cubes from the decomposition, can cause computational problems when an arbitrary ilr coordinate system (primarily designed for vector compositional data) is used for the representation of independence and interaction cubes. For example, this can be the case for robust statistical analysis (de Sousa et al 2021), but also in general it is desirable to assign to each of the cubes from the decomposition such a number of coordinates (out of the total number of them) that reflects their respective dimensionality. The second problem concerns the interpretation of the results. Even though it is usually possible to convert the results back to the simplex, it is convenient to proceed with the analysis in some well-interpretable coordinates. Obviously, balances, defined as a log-ratio between two groups of parts, are not able to capture the multi-factorial nature of the compositional cubes. Although they can help to describe the relative structure of each factor separately, for a description of interactions we need to construct some alternative coordinate system. The construction of such orthonormal coordinates is presented in the following section.
Coordinate representation of compositional cubes
In this section we will focus on a possible coordinate representation of three-factorial compositional data, compositional cubes, which simplifies substantially the construction of ilr coordinates for compositional tables proposed in Fačevicová et al (2018). A deeper understanding of the structure of this coordinate representation allows its generalization and application to compositional data describing relationships given by more than three factors. In order to keep the construction as simple as possible, we consider a vectorized version of the cube
| 20 |
As it was already suggested, balances can help to describe the relative structure within each factor. For this purpose, the whole rows, columns and slices (each represented by the geometric mean across all levels of the remaining factors) should be taken. After steps of the sequential binary partition applied on the levels of the row factor (SBPr), a system of vectors (of length IJK) is obtained. The i-th generating vector has entries at positions corresponding to parts from the rows of the cube , which were in the respective step assigned to the group, and at positions corresponding to parts from the rows assigned to the − group, and zero elsewhere. The first group of coordinates is thus simply formed by row balances, which characterize the structure of the row factor when the influence of the other factors is suppressed,
| 21 |
An example can be seen in Sect. 3.1, where this first set of coordinates results in balance between male and female employees, regardless of their age or type of the contract. A similar construction can be made for column and slice factors. In the first case, a sequential binary partition of the whole columns (SBPc) results in a system of vectors with entries , and 0, always corresponding to parts of columns from the group, columns from the − group, and columns not included to the respective step of SBPc. The inner structure of the column factor is preserved through column balances
| 22 |
which form the second group in the coordinate system representing the whole compositional cube as well as its independent part . The third group describes the structure of the slice factor. A sequential binary partition of the whole slices (SBPs) now determines the final system of vectors with entries , and 0, corresponding to slices from group , slices from group −, and the remaining slices not included in the k-th step, respectively; the slice balances are
| 23 |
Note here that the row, column and slice balances form a complete coordinate representation of , since their number equals the dimensionality of the independence table. All the remaining coordinates, completing the orthonormal coordinate system of (e.g. those given by Eqs. (24)–(27)), are zero.
When row, column and slice SBPs are defined, we can immediately construct the remaining elements of the coordinate system of the cube , which also correspond to coordinates of . For this purpose the normalized Hadamard (entry wise) product () of the vectors , and and Eq. (3) is used. The vectors , , determine coordinates of type
| 24 |
for and , capturing the interactions between row and column factors, which through the geometric mean suppress the influence of the slice factor. Obviously, these coordinates are formed by four groups of parts (denoted as A, B, C, D and represented by their respective geometric means), and they can be interpreted in terms of a log-odds ratio, which is also used in a standard statistical analysis of two-factorial data (Agresti 2002). This system of coordinates, composed into the ()-component vector with on the positions corresponding to and zeros elsewhere, thus allows for the analysis of the relations exclusively between row and column factors.
The Hadamard product of and , , , leads to coordinates
| 25 |
for and , which capture the information about the relations between row and slice factors, when the influence of the column factor is suppressed. Similar to the case of , also coordinates contained in the respective vector can be interpreted in terms of a log-odds ratio and are utilized, when the relationship between row and slice factors is of primary interest.
Finally, the Hadamard products of and , , lead to vectors and coordinates
| 26 |
for and . The system of these coordinates completes the odds ratio-type coordinates with those concerning relations between column and slice factors.
To complete the original data structure, also full interactions between all three factors need to be contained in the coordinate system. The remaining coordinates are determined by the Hadamard product of all three types of vectors, and , , , , and have the general form
| 27 |
for , , and , where
| 28 |
is a constant ensuring orthonormality of the coordinates. Even though the interpretation of this last group of coordinates may be a bit tricky (one possible interpretation is in terms of a log-ratio of two odds ratios), their definition is necessary to complete the system of orthonormal coordinates of the original table . Moreover, when a sample of compositional cubes is available, these coordinates can be used for instance to test for the presence of full interactions.
For an easier understanding of the coordinate structure, especially the assignment of parts into groups, Fig. 2 provides a graphical representation of each type of proposed coordinates. Moreover, the specific system of generating vectors, respective coordinates, and their interpretation is given in Sect. 3.1.
Fig. 2.
Graphical representation of groups of parts involved in each type of coordinates forming the whole coordinate system designed for compositional cubes
Besides the benefits of the proposed coordinate system in terms of interpretation, it is important to point out that the coordinates reflect the dimensionality of the sample space of the decomposed parts to which they are assigned, and thus allow to analyze these parts separately. Of course, each decomposed part is still a cube of the same dimension as the original one (with I rows, J columns and K slices) and its coordinate representation must contain components, but the structure of the vector of coordinates follows the one introduced for of the interaction cube . More specifically, e.g. the cube is represented in the proposed system with coordinates and the remaining coordinates defined in this section equal zero. Accordingly, with respect to the decomposition described in Sect. 2.2, for the coordinate representation of the original compositional cube the following relation holds,
| 29 |
Even though the interpretation of the coordinate system is determined by the initial SBPs, any other relationship within the compositional cube is reachable through a transformation matrix , whose rows are formed by coefficients of the respective logarithmized parts of in the desirable log-ratios. According to (4), the vectorized form of a compositional cube is equal (after closure) to . Therefore, a system of log-contrasts representing a given compositional cube equals
| 30 |
An example of such a transformation for the coordinates constructed in Sect. 3 is provided in Appendix.
General properties of multi-factorial compositional data
The findings from Sects. 2.2 and 2.3 can be directly extended to a general k-factorial case. k-factorial compositional data are formed by a k-dimensional array of positive entries, representing a relative structure given by the levels of k constituting factors. Also such a complex structure contains its independent and interactive parts, where the independent part is equal to the product of (vector) geometrical marginals. The sources of interactions are given by the relations between pairs, triplets, quaternions, etc. of constituting factors, and therefore the interactive part can be further orthogonally decomposed to objects carrying information about each of these sources. The main principle is based on aggregation over the redundant dimensions and expression of interactions within the resultant object. In the case of compositional cubes we have seen that pairwise interactions actually correspond to the interactive part of a compositional table formed by geometric means computed across the levels of the third factor, see e.g. Eq. (15). Similarly, in the case of a four-factorial compositional object, all sources of interactions between a selected triplet of factors can be reached by a decomposition of a cube given by an aggregation over the remaining fourth dimension. When we vary over the fourth dimension, all pairwise and three-way interactions are extracted and, finally, by subtraction of all these parts together with the independent one, the object preserving the full interactions is reached (similar as in the Eq. (18)).
Section 2.3 shows that the whole coordinate representation of a compositional cube is determined by three systems of SBPs, separately given for the levels of row, column and slice factors. Similarly, also k-factorial compositions can be represented in orthonormal coordinates. Balances between levels of the individual factors characterize the independent part of the object. Log-contrasts obtained from the Hadamard product of pairs, triplets, etc. of SBP basis vectors and Eq. (3) then represent the respective sources of interactions.
Example: employment structure—continuation
Let us go back to the employment structure data set introduced in Sect. 2.1. According to the proposed methodology, each cube from the sample can be represented by a system of coordinates (Sects. 3.1 and 3.2), which allows for a deeper analysis of the structural patterns (Sect. 3.3).
Coordinate representation
Following Section 2, the row, column and slice SBPs need to be determined prior to the construction of coordinates. For the factors “type of contract” and “gender”, only their two levels need to be separated. Consequently, the first two generating vectors are
and
where the components of these vectors correspond to the cells of the vectorized form of the cube,
There are more options for the slice SBP, where the analyst can decide which age group has to be separated first. Here, the youngest group was firstly separated from the remaining two groups and, in the next step, the middle-aged group (25–54 years) from the oldest. The other options, starting with the separation of the middle-aged or the oldest group, respectively, would lead to similar results (in terms of presence of interaction between factors), but they would slightly alter the interpretation. In the presented case, the generating vectors are
and
Following the construction from Sect. 2, the Hadamard product of the above derived generating vectors leads (after their normalization) to the remaining system of vectors. Particularly,
| 31 |
Finally, according to Eq. (3), these vectors lead to a system of 11 orthonormal coordinates:
A graphical representation of these coordinates is provided in Fig. 3.
Fig. 3.

Graphical representation of the groups of parts involved in particular coordinates. Yellow parts constitute the numerator, green the denominator and white parts are not included in the respective log-ratio. (Color figure online)
Interpretation
The interpretation of the coordinates can be discussed on the example of the data from the Czech Republic, see Table 1. The set of row, column and slice balances (as a subvector of ) corresponds to
These numbers are interpretable, as usual for balances, in terms of a dominance of either the group of cells in the numerator (positive value) or denominator (negative value) of the respective log-ratio. Accordingly, in the Czech Republic the proportion of female employees slightly dominates over the proportion of males (), and full-time contracts clearly dominate over part-time contracts (). The slice balances contain information about the age structure of the employees. Due to the high negative value of coordinate it can be concluded that the youngest employees are outbalanced by those from the middle age and older groups; within these latter groups, employees aged between 25 and 54 years prevail (coordinate ). More specifically, the ratio between female and male employees is 1.19 (without the normalizing constant and logarithm), full-time contracts prevail the part-time almost by a factor of fifteen, the group of 25 + employees is about 4.6 times bigger than the group of youngest ones and, finally, there are about twice more employees aged between 25 and 54 than the oldest ones (55 +). Another possible interpretation is in terms of an average log-ratio between the given groups of employees across all combinations of the remaining factors. E.g., if the coordinate is divided by its normalizing constant , it turns out that the average log-ratio between female and male employees across all combinations of age groups and types of contract is 0.176. An important source of information are odds ratio coordinates, which for the Czech Republic result in
Coordinate compares the type of contract of male and female employees, and the negative value indicates that the proportion of males with full-time contract, compared to those employed on part-time, is higher than the same proportion of females or, alternatively, that the proportion of females is higher within employees with a part-time contract than within those with a full-time contract. The raw odds ratio between these four groups (formed by the geometric mean across all age groups) equals 0.33 () and the mean log-odds ratio across the age groups is (). Coordinates and compare the age structure of male and female employees and complete the information carried by the balances . The coordinate reveals that the youngest group (15–24) is dominated by older employees—the coordinate adds that this dominance tends to be slightly higher for male employees. On the other hand, the value of the coordinate indicates that the dominance of the age group 25–54 over 55 + tends to be higher for females. Also the coordinates and can be interpreted in the sense of odds ratios, by comparing the proportion between full- and part-time contracts in several age groups. The last group of coordinates is formed by and , for the Czech Republic with values 0.124 and , respectively. These coordinates inform about mutual relations between all three factors and their interpretation becomes a bit tricky. Despite of this complexity (comparable to the complexity of double interaction terms in regression models), the interpretation in the sense of a double odds ratio is still possible. For instance, it was already derived that females are employed more often part-time than males; due to a positive value of it can be concluded that this relation differs according to the age of employees, specifically it becomes less visible in the youngest group. The proposed system of orthonormal coordinates is appropriate for the further statistical analysis of the relations within each cube. For a more detailed interpretation, the function cubeCoordWrapper of the R package robCompositions also allows to compute all coordinates without the normalizing constant and therefore to easier quantify the respective relations.
Statistical analysis
Since a sample of 32 compositional cubes is available, this sample can be investigated in the light of the relative structure. Due to the geographical and economical proximity of some countries, the assumption of independence of the observations seems not to be sufficiently met in this case, and even though the proposed coordinate system is in general designed to allow for any statistical processing, this prevents from using standard inference here. First of all, the behavior of the coordinates in the sample can be described using boxplots, see Fig. 4, and bootstrap confidence intervals for the means (both computed by cubeCoordWrapper), which are collected together with the sample mean values and standard deviations in Table 2.
Fig. 4.
Boxplots of the coordinates describing the employment structure
Table 2.
List of sample means, standard deviations, and bootstrap confidence intervals for the mean of the coordinates describing the employment structure
| Mean | SD | CI | Mean | SD | CI | ||
|---|---|---|---|---|---|---|---|
| 0.171 | 0.322 | (0.064, 0.271) | 0.164 | (− 0.235, − 0.124) | |||
| 3.246 | 1.289 | (2.849, 3.697) | 0.230 | 0.217 | (0.158, 0.307) | ||
| − 2.102 | 0.638 | (− 2.323, − 1.903) | − 0.591 | 0.490 | (− 0.752, − 0.426) | ||
| 1.666 | 0.411 | (1.538, 1.809) | 0.631 | 0.286 | (0.537, 0.724) | ||
| − 0.812 | 0.333 | (− 0.919, − 0.701) | 0.179 | 0.222 | (0.108, 0.255) | ||
| − 0.134 | 0.135 | (− 0.182, − 0.087) |
The scale of the different coordinates presented as boxplots in Fig. 4 is comparable since it always refers to the log-ratios of the employment data. It is obvious that the countries in the sample differ mainly in the coordinate , comparing the proportionality of full-time and part-time contracts. Larger differences are also visible in coordinate , where negative values for all countries are obtained, and thus employees older than 25 years dominate. According to the bootstrap confidence intervals shown in Table 2, the effects of the different factors and factor combinations represented by the coordinates are all significant. Thus, not only the previously mentioned simple balances between the factor levels but also interactions between factors strongly influence the overall employment structure.
Besides these general statements, the coordinate representation also allows for a graphical visualization of the regional patterns. For example, the values of the most variable coordinates and are shown in Fig. 5. From the left map it is clearly visible that high values of coordinate , and therefore a big dominance of the full-time contracts, are typical for countries which used to be under the influence of the former Soviet Union. On the other hand, the highest negative values of the coordinate , and therefore the biggest prevalence of the older group of employees (25 +), are typical for the southern countries like Italy, Greece or Spain.
Fig. 5.
Values of the coordinate representing the log-ratio between full-time and part-time contracts (left) and the coordinate , which represents the log-ratio between the youngest group of employees and the rest (right)
Even though the simple balances and already carry an important piece of information about the employment structure, they suppress the influence of the remaining factors, as described in Sect. 2.3. The possible deviations from the independence between factors are preserved in the coordinates —, and the main sources of variability in this regard can be found e.g. by using principal component analysis (PCA) applied on this set of coordinates. Moreover, since the proposed coordinate system respects the dimensionality of the interactive part of the cube, also a robust version of PCA, based on the minimum covariance determinant (MCD) estimates of location and covariance, can be used. This idea was intensively described for the case of compositional tables in de Sousa et al (2021). Note that even though the data can suffer from the presence of a spatial dependence, here classical PCA was used to reveal the main sources of variability. The geographical aspect will be of interest in the later part of the analysis, which is devoted to clustering. The biplot based on the first two robust principal components is shown in Fig. 6. According to this result, we can say that the main sources of differences between the countries in the sample, in terms of deviations from independence, are the coordinates and . The left map in Fig. 7 visualizes the values of the coordinate . This coordinate is negative for every country in the sample, and therefore the FT/PT ratio is always higher within male employees compared to females. The biggest differences appear in the Netherlands, Belgium, Germany, Austria and Italy; this can be caused e.g. by the popularity of part-time contracts for female employees. On the contrary, in the countries from the eastern part of Europe, where part-time contracts are not very popular in general (see Fig. 5), the difference is less visible. The right map in Fig. 7 shows values of coordinate . Also this coordinate is mostly negative, the FT/PT ratio is therefore higher for the older group of employees (25 +). Or, conversely, we can say, that the part-time contracts are mostly popular in the age group 15–24. This difference is mostly visible for Denmark and the Scandinavian countries, followed by the Netherlands, Spain and Slovenia. In the remaining countries, the effect of age on the FT/PT ratio is rather negligible.
Fig. 6.

Biplot of the first two principal components resulting from the robust PCA, performed on the coordinates of the interactive part of the compositional cube
Fig. 7.
Values of the coordinate , comparing the FT/PT ratio between female and male employees (left), and the coordinate , which compares the same ratio but between the youngest group of employees and the rest (right)
Robust PCA generates some clusters of countries with similar characteristics in terms of deviations from the independence of the factors. For example, Finland, Sweden, Norway and Denmark have high values on the first component, Romania, Croatia, Bulgaria and Macedonia have high values on the second principal component. A natural question is, whether it is possible to obtain geographically compact clusters of countries with similar employment structure and what are the main features defining these clusters. For this purpose, a clustering method based on two dissimilarity matrices was used Chavent et al (2018). In this method, the first matrix measures the dissimilarity between the numerical variables, in our case the coordinates –, and the second provides information about neighboring countries. The result is shown in Fig. 8. The first cluster is formed by the Scandinavian countries and Slovenia, with high negative values for and very low, (almost zero) values for coordinate . This shows that these countries are characterized by a high popularity of part-time contracts within the youngest group of employees (see the right map on the Fig. 7 for comparison), which moreover holds despite their gender. The second cluster includes countries from Western and Southern Europe, such as Austria, Germany, France, Italy and Spain, with high popularity of part-time contracts for female employees. This property is represented with high negative values of coordinate and is clearly visible in Fig. 7. Finally, the Baltic countries and Russia form another compact cluster. They are characterized by a low difference in the ratio between female and male employees within the middle age (25–54) and the oldest (55 +) group ( close to zero) and also by a small difference in the same ratio within the employees for full-time and part-time contracts ( close to zero).
Fig. 8.
Result of spatial clustering based on coordinates – representing the interactive part of compositional cubes
Example: mobility data
In the second example, the interest is in the change of the mobility behavior of people in Austria within the time period February 3rd until August 2nd, 2020, thus during the first period of the COVID’19 pandemic. Mobility is measured through the radius of gyration (ROG), a time-weighted distance of the daily movement locations of mobile phones to the main location of the phone owner, see Heiler et al (2020a, 2020b) for details. The phone owners are classified with respect to gender and five age groups (15–29, 30–44, 45–59, 60–74, 75 +). The mobility of each group is represented by the respective median value of ROG. This dataset was already analyzed in Heiler et al (2020a), where the relative differences in the mobility between the age groups were studied through the clr coefficients. This compositional analysis showed an interesting change in the behavior of the youngest (15–24) and oldest (75 +) part of the population during the lockdown period from March 16th to April 6th, 2020 (weeks 12, 13 and 14). The current results are based on the separate analysis of males and females, when a more complex insight can be reached by a simultaneous study of the age and gender structure. Moreover, when the daily records are aggregated according to the part of the week, the relative differences in mobility over weekdays and weekends can be taken into account. The data at hand can therefore be understood as time series of 26 compositional cubes, each representing the relative mobility structure within one week from weeks 6-31 of 2020. The row levels are formed by gender (F—female, M—male), the columns by the parts of the week (WD—weekdays, WE—weekend), and the slices represent the different age groups.
Prior to the main part of the analysis, the SBP of the slice factor needs to be defined. With respect to the findings in Heiler et al (2020a), a separation of the economically active (15–59) and non-active (60 +) population seems to be advisable in the first step. The results of the simple clr-based analysis of Heiler et al (2020a) help to define also the remaining steps of the SBP: In the second step, the youngest group is separated from those aged between 25 and 59, and the third step focuses on the relative dominance of group 30–44 over 45–59. Finally, the relative dominance of mobility within the age group 75 + over 60–74 is highlighted by the last step of the slice SBP.
Based on this coordinate representation, some interesting patterns and their sources can be revealed by PCA. As the main aim of the analysis is to illustrate the principles of working with multi-factorial compositional data, the classical approach was applied here. Note, however, that considering the time dependent nature of the data would lead to more accurate results. Figure 9 shows biplots based on the first four principal components, describing of the whole variability.
Fig. 9.
Biplots based on the first four principal components computed for the mobility data. The numbers represent the week numbers in 2020
In order to show the whole structure of the PCA results, the loadings of all PCs are collected in Fig. 10. However, the vast majority of the variability () is explained by the first four principal components, whose values are always mainly driven by a single coordinate:
PC1—, log-ratio between weekdays and weekend mobility, aggregated over all gender and age groups,
PC2—, log-ratio between mobility of economically active and non-active population, aggregated over all other categories,
PC3—, log-ratio between mobility of age groups 15–29 and 30–59, aggregated over all other categories,
PC4—, log-odds-ratio comparing economically active and non-active mobility ratio over weekdays and weekends, aggregated over gender.
Fig. 10.
Loadings of the principal components computed for the mobility data
The main source of variability is given by the ratio between weekdays and weekend mobility (). As it is visible in Fig. 11, this log-ratio was varying over the whole studied period but it was also atypically high during weeks 11 and 12 preceding the lockdown period. This gives an evidence on a decrease of the weekend mobility. The second principal component helps to detect typical characteristics for the weeks during and immediately after lockdown (weeks 12–17). When the relative mobility of the economically active population, quantified by , was among the lowest during lockdown, mobility of the oldest group 75 + was nearly comparable to the mobility of the population aged between 60 and 74 (coordinate ). Moreover, based on coordinate , Fig. 12 shows a remarkable difference in the economically active and non-active population mobility ratio during weekdays and weekends, when the former tends to be higher during lockdown. The weeks from the end of the observed period (weeks 24–31) are nicely separated by the third principal component. The relative mobility of the youngest group 15–29 is among the lowest in comparison to the mobility of group 30–59 (). Moreover, quite stable and high values of coordinate , comparing the ratio between the economically active and non-active population for females and males, are typical for this period. Finally, PC4 characterizes the behavior during the weeks 16–22 immediately following the lockdown, with the only exception of week 20. According to the respective loadings, the exclusion of weeks 16-22 is not only driven by relatively low values of , but also by a mixture of other phenomenons. For these weeks we observe a high relative dominance of mobility of group 15-29 over the mobility of those aged between 30 and 59 (), and a high relative dominance of the mobility of group 30–44 over group 45–59 ().
Fig. 11.
Selected balances (without the normalizing constant) computed for the mobility data and their development over time. The red dashed lines define the first lockdown period in Austria
Fig. 12.
Selected odds-ratio coordinates (without the normalizing constant) computed for the mobility data and their development over time. The red dashed lines define the first lockdown period in Austria
Conclusions
It has been demonstrated that the concepts developed for two-factorial compositional data can be extended to compositional cubes, and even to the general k-factorial case. The fundamental idea is to investigate the relative data structure in terms of log-ratios between different factors and factor levels. One advantage of such an approach is scale invariance, which is particularly useful when the reported values of the observations are not comparable (in our example caused by different population sizes of the countries) or if the relative structure is of main interest.
It has been shown that each compositional cube can be decomposed into its independent and interactive parts. Furthermore, the interactive part can be decomposed into cubes representing the pairwise factor interactions and the interaction between all three factors. It turned out that the components of the interactive part have an advantageous property of uniform marginals and, moreover, that the principle of the decomposition can be directly extended to the general case of k-factorial compositions. Since the commonly used systems of orthonormal coordinates are not able to sufficiently describe the multi-factorial nature of cubes and respect the possibility of its decomposition, an alternative system has been proposed. Moreover, this system can be constructed in a flexible manner, basically according to the needs or expert knowledge of the analyst: There might be a certain hypothesis on the relations between factors and/or factor levels, and based on the principle of sequential binary partitions (SBPs), these combinations can be reflected by the constructed coordinates. Even though the proposed coordinate representation allows to describe the overall relations between factors, similar as in the case of the well developed theory for the analysis of vector compositional data, it is also possible to use them for further analysis with standard statistical methods, or to perform statistical inference with the coordinates, for example, by constructing bootstrap confidence intervals for the mean, in order to determine if the effect conveyed by the coordinate is significant. A proper coordinate representation of the multi-factorial compositional data can therefore be understood as a first step in the analysis, possibly followed by other advanced statistical methods. For example, regression methods with compositional regressors with or without the total (Coenders et al 2017), or any other proper methods of one-factorial (vector) compositional data analysis (Pawlowsky-Glahn et al 2015) can be used (after its possible adaptation to the more complex structure of coordinates).
The idea of modeling interactions between factors using the normalized Hadamard product of vectors, derived from SBPs at the single factor level, works equivalently for a higher number of factors. Therefore, the approach presented here for compositional cubes can be extended in a straightforward manner to higher-order arrays.
Acknowledgements
The work was supported by the Czech Science Foundation, Project 22-15684L, and by the Austrian Science Foundation, Project I 5799-N.
Appendix
The SBPs defined in Sect. 3.1, lead to the following contrast matrix,
The rows of the matrix correspond to coordinates specified in Sect. 3.1, and they are ordered in the following way,
The columns correspond to parts of . Vector , the vectorized form of the compositional cube, consists of parts
An alternative partition can be defined for the slice factor—the age of the employees. If the relative dominance of the oldest group over the remaining two is of interest, one can separate this group from the rest first, when defining the slice SBP. The second step can then separate the middle-aged group (25–54 years) from the youngest, which leads to the coordinate quantifying the relative dominance of employees aged between 25 and 54 years over those younger than 25. This strategy would alter the structure of the generating vectors, which now result in
and
The generating vectors which are not related to the values of the slice factor remain the same as in Sect. 3.1. More specifically,
All the remaining vectors change as follows:
The new system of coordinates can be obtained through the Eq. (4), with the contrast matrix
or, alternativelly, as a rotation of vector . According to (30) the rotation matrix equals
In both situations, the coordinate system , highlighting the relative dominance of the oldest group of employees, is formed by the following eleven coordinates:
Author Contributions
All authors contributed to the final version of the manuscript. Kamila Fačevicová developed the main part of the presented theory and prepared the first example. Karel Hron and Peter Filzmoser contributed to the theoretical developments and prepared the second illustrative example. The first draft of the manuscript was written by Kamila Fačevicová and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding Information
The research of KF and KH leading to these results was supported by the Czech Science Foundation, project 22-15684L and work of PF was supported by the Austrian Science Foundation, project I 5799-N.
Data Availability
The data analyzed within the first example are freely available at the OECD data repository http://stats.oecd.org. The mobility data analyzed in the second example are not publicly available because of data ownership reasons.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Kamila Fačevicová, Email: kamila.facevicova@gmail.com.
Peter Filzmoser, Email: Peter.Filzmoser@tuwien.ac.at.
Karel Hron, Email: karel.hron@upol.cz.
References
- Agresti A. Categorical data analysis. 2. New York: Wiley; 2002. [Google Scholar]
- Aitchison J. The statistical analysis of compositional data (with discussion) J R Stat Soc Ser B (Stat Methodol) 1982;44(2):139–177. [Google Scholar]
- Aitchison J. The statistical analysis of compositional data. London: Chapman and Hall; 1986. [Google Scholar]
- Billheimer D, Guttorp P, Fagan WF. Statistical interpretation of species composition. J Am Stat Assoc. 2015;96(456):1205–1214. doi: 10.1198/016214501753381850. [DOI] [Google Scholar]
- Chavent M, Kuentz-Simonet V, Labenne A, et al. Clustgeo: an r package for hierarchical clustering with spatial constraints. Comput Stat. 2018;33:1799–1822. doi: 10.1007/s00180-018-0791-1. [DOI] [Google Scholar]
- Coenders G, Martín-Fernández JA, Ferrer-Rosell B. When relative and absolute information matter: compositional predictor with a total in generalized linear models. Stat Model. 2017;17(6):494–512. doi: 10.1177/1471082X17710398. [DOI] [Google Scholar]
- de Sousa J, Hron K, Fačevicová K, et al. Robust principal component analysis for compositional tables. J Appl Stat. 2021;48(2):214–233. doi: 10.1080/02664763.2020.1722078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eaton ML. Multivariate statistics. A vector space approach. New York: Wiley; 1983. [Google Scholar]
- Egozcue JJ, Pawlowsky-Glahn V. Groups of parts and their balances in compositional data analysis. Math Geol. 2005;37:795–828. doi: 10.1007/s11004-005-7381-9. [DOI] [Google Scholar]
- Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, et al. Isometric logratio transformations for compositional data analysis. Math Geol. 2003;35(3):279–300. doi: 10.1023/A:1023818214614. [DOI] [Google Scholar]
- Egozcue JJ, Díaz-Barrero JL, Pawlowsky-Glahn V (2008) Compositional analysis of bivariate discrete probabilities. In: Daunis-i Estadela J, Martín-Fernández JA (eds) Proceedings of CODAWORK’08, The 3rd compositional data analysis workshop, University of Girona, Spain
- Egozcue JJ, Pawlowsky-Glahn V, Templ M, et al. Independence in contingency tables using simplicial geometry. Commun Stat. 2015;44(18):3978–3996. doi: 10.1080/03610926.2013.824980. [DOI] [Google Scholar]
- Fačevicová K, Hron K, Todorov V, et al. Logratio approach to statistical analysis of compositional tables. J Appl Stat. 2014;41(5):944–958. doi: 10.1080/02664763.2013.856871. [DOI] [Google Scholar]
- Fačevicová K, Hron K, Todorov V, et al. Compositional tables analysis in coordinates. Scand J Stat. 2016;43:962–977. doi: 10.1111/sjos.12223. [DOI] [Google Scholar]
- Fačevicová K, Hron K, Todorov V, et al. General approach to coordinate representation of compositional tables. Scand J Stat. 2018;45(4):879–899. doi: 10.1111/sjos.12326. [DOI] [Google Scholar]
- Fačevicová K, Kynčlová P, Macků K. Geographically weighted regression analysis for two-factorial compositional data. In: Filzmoser P, Hron K, Martín-Fernández JA, Palarea-Albaladejo J, editors. Advances in compositional data analysis. Cham: Springer; 2021. pp. 103–124. [Google Scholar]
- Filzmoser P, Hron K, Templ M. Applied compositional data analysis. Cham: Springer; 2018. [Google Scholar]
- Heiler G, Hanbury A, Filzmoser P (2020a) The impact of covid-19 on relative changes in aggregated mobility using mobile-phone data. arXiv: 2009.03798
- Heiler G, Reisch T, Hurt J, et al (2020b) Country-wide mobility changes observed using mobile phone data during covid-19 pandemic. In: 2020 IEEE international conference on big data (big data). IEEE Computer Society, Los Alamitos, CA, USA, pp 3123–3132, 10.1109/BigData50022.2020.9378374
- Pawlowsky-Glahn V, Egozcue JJ. Geometric approach to statistical analysis on the simplex. Stochast Environ Res Risk Assess (SERRA) 2001;15(5):384–398. doi: 10.1007/s004770100077. [DOI] [Google Scholar]
- Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R. Modeling and analysis of compositional data. Chichester: Wiley; 2015. [Google Scholar]
- Templ M, Hron K, Filzmoser P. robCompositions: an R-package for robust statistical analysis of compositional data. In: Pawlowsky-Glahn V, Buccianti A, editors. Compositional data analysis: theory and applications. Chichester: Wiley; 2011. pp. 341–355. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data analyzed within the first example are freely available at the OECD data repository http://stats.oecd.org. The mobility data analyzed in the second example are not publicly available because of data ownership reasons.









