Graphical abstract
Method name: Data-driven cluster-similarity analysis
Keywords: Cluster analysis, Similarity measure, Asset allocation, Risk analysis, Investment decisions
Abstract
Aiming at supporting the process of investment portfolio diversification by using a data-driven approach, the present methodological paper proposes a new cluster analysis, which compares publicly traded companies, mainly in times of high volatility (e.g. crisis times). The main goal of the proposed method is to provide a less arbitrary analysis to support financial investors to precisely measure the degree of similarity between equity stocks, unveiling equity market clustering patterns by applying analytic geometry solutions and calculating an overall clustering pattern indicator. Empirical results on synthetic data demonstrate either that the proposed method has conceptual superiority over traditional cluster analyses and its potential practical usefulness to asset allocation, portfolio strategy, asset pricing, among other related purposes. Finally, the outputs of the proposed cluster analysis are presented through an intuitive and easily understandable mathematical visualization.
-
•
It is proposed a new method to calculate risk-similarity and clustering patterns.
-
•
The method unveils clustering patterns through a data-driven process.
-
•
Portfolio diversification can benefit from sphere-sphere intersection calculations.
Specifications Table
| Subject Area: | Social Sciences |
| More specific subject area: | Econophysics |
| Method name: | Data-driven Cluster-similarity Analysis |
| Name and reference of original method: | Main original method names: k-means Clustering and Hierarchical Cluster Analysis (HCA) Main original method reference: Jain, A.K., 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), pp. 651–666 |
| Resource availability: | N/A |
Method details
Introduction
The greater frequency and impact of financial and economic crises requires the development of feasible methods that aim to clearly distinguish investment alternatives in a robust, consistent, and coherent manner [[1], [2], [3], [4]]. The main goal of the proposed data-driven cluster-similarity risk method is to support financial investors to precisely measure the level of similarity between publicly listed companies through time - especially in turbulent periods, showing the outputs through a clear graphical representation as a by-product of the analysis. As depicted in the Graphical Abstract, the method is performed through three subsequent steps, in which each asset1 is represented by a correspondent 2-sphere (i.e. ordinary three-dimensional sphere) in the three-dimensional Euclidean space ().
In step one is performed an analysis over the trajectory of each equity stock through time as well as the calculation of the variation of the individual risk factor – termed2 as – by finding the correspondent 2-sphere (simply sphere onwards) radius. Subsequently, in step two the spatial approximation between assets/spheres is visualized and the intersection volume – termed as – between every pair of spheres in the sample is calculated using analytical geometry. Finally, in step three either the individual spherical volumes and the intersection volumes calculated in the previous steps are used as input values placed into an overall clustering pattern indicator bounded from zero to one, which is a proxy measure designed to meaningfully assess the level of shared risk between all stocks in a sample based on their level of similarity at any particular date as well as through time .
Most of the labour involved in the proposed method occurs in the first two steps, which refer to geometrical calculations and computations involving spheres and intersection volumes in the three-dimensional Euclidean space. However, the labour required is much compensated by the precise outputs generated, which results in consistent estimates as well as clear mathematical visualization of the calculations and analyses performed. The motivations to create and propose this non-hierarchical clustering method are based on an attempt to tackle some relevant problems and challenges often reported in the cluster analysis literature, which traditional clustering methods – such as k-means clustering and hierarchical cluster analysis (HCA) – do not address properly [5,6], such as: (i) setting the number of clusters in a non-arbitrary manner (i.e. through a data-driven process); (ii) possibility of a dataset having zero cluster; and (iii) allocating outliers to no cluster.
Although the proposed method was primarily designed to work with quantitative continuous variables as input data, it is possible to use virtually any type of data (e.g. nominal, ordinal, binary) to be placed into any of the three axes in . An important characteristic of using such non-continuous variables is that the spatial distance between each value in the axis must be equally spaced. Therefore, a relevant caveat of using such variables is that this equal spatial distance between each value needs to be set arbitrarily by the analyst. This arbitrary setting can potentially distort the analysis performed, by allowing the analyst to manipulate the outputs – for instance, by setting a very large spatial distance between each value in the axis in order to decrease the number of spherical volume intersections and, consequently, artificially reducing the number of clusters in a particular sample, which is extremely undesirable. This problem is avoided once input data are based on continuous variables since the spatial distance between each value in the axis would be determined solely by the data itself (i.e. data-driven), following a predetermined standard scale (e.g. company revenue, inflation rate, exchange rate).
Following this introduction, the paper is divided into five sections and it proceeds as follows. Sections Asset Individual Trajectory (Step 1), Spatial Approximation between Assets (Step 2) and Clustering Pattern Indicator (Step 3) explore and detail the first, second, and third steps of the proposed method, respectively. Section Experimental Outputs and Results provides two case studies and a brief discussion over the results. Finally, the last section concludes and suggests extensions for future research.
Step 1: asset individual trajectory
The proposed analysis starts by verifying the spatial trajectory of each stock (based on their respective axial variables) as well as the contemporaneous progress of their respective individual risk factor (based on their respective spherical volume). There are four values to be used as data input in the proposed analysis: the first, second, and third values are reflected on the -axis, -axis, and -axis, respectively; and the fourth value is reflected in the volume of each sphere (i.e. or ).
Each of those four values (i.e. , , , and ) is based on the rate of variation (from t − 1 to ) of four real-world variables (i.e. , , , and ), respectively. The rate of variation of each of the four real-world variables to be placed into each of the three axes as well as reflected in the sphere volume are calculated as follows [7]:
| (1) |
Where and , which refer to the number of subjects and time span of the sample, respectively. Throughout this paper and in distinct contexts, the subjects and are frequently used to exemplify two generic sample subjects. Worth noting that the fourth variable is in absolute value, which reflects the volume of the respective sphere in . Therefore, the time series placed into the three axes as well as the one used as the spherical volume are sequenced as follows:
Each sphere represents a different stock within a particular industry in the three-dimensional Euclidean space. For instance, in the automotive industry would be reasonable to compared stock indicators (i.e. data input) from companies such as Fiat, Ford, General Motors, Toyota, and Volkswagen; in the technology industry, would be analysed data from companies such as Alphabet/Google, Amazon, Apple, Facebook, and Microsoft; among many more possibilities.
The volume of each sphere is analytically calculated and graphically depicted in . The spherical volume aims to reflect a variable that is recognised as the best proxy available to measure the level of risk of a particular data set. This variable should be relevant and meaningful according to the point of view of the investor/ analyst who is performing the analysis. Moreover, this risk proxy should account for industry context and idiosyncrasies (e.g. levels of expected stock price return, equity-to-debt ratio, market capitalization, or any other well justified indicator). As illustrated in Fig. 1, this risk proxy varies dynamically according to each different time as well as distinct stock in the sample. Earnings per share (EPS), price to earnings (P/E) ratio, price to book value (P/B) ratio, and dividend yield are just a few examples, among a wide range of possibilities, of real-world economic variables and/or financial indicators to be placed at any of three axes in (Fig. 1).
Fig. 1.
Visualisation of spatial trajectories of stocks and in from (left) to (right).
As applicable to any quantitative analysis, it is of utmost importance that the analyst selects coherently and justifies properly each variable to be used as input data to perform the proposed method, specially the fourth and most important variable to be reflected in the volume of the respective sphere (i.e. ). The analysis over the question on how to choose the most appropriate variables is out of the scope of this paper, which can be based on a series of factors, such as individual investment preferences, risk tolerance, market consensus, among many others.
The reason behind the analysis being performed in the three-dimensional space relies on the fact that, compared to the two-dimensional space , one additional variable is included in the analysis. This additional variable enriches the analysis by allowing more information being added as input data. Moreover, although feasible, the reason to not propose the present method in an -dimensional Euclidean space relies on the fact that, on one hand the analysist can use variables instead of three, which is desirable. On the other hand, however, in there would not be a consistent graphical visualisation of the outputs.
Asset individual risk factor:
As depicted in Fig. 2, each equity stock is represented by its respective sphere in and their individual volume is given by a real-world economic or financial variable, termed as , which is calculated as follows:
| (2) |
Where is the time-varying radius of the sphere, is the well-known ratio of the sphere’s circumference to its diameter, and is the gamma function; all variables refer to stock at time . The formulas in Eq. (2) refer to the classical formulas to calculate the volume of a sphere in a three-dimensional Euclidean space () and it was adapted to the present paper in order to incorporate the temporal variation reflected in the volume as well as the time-varying radius of the respective sphere. In addition, in the case one wants to check the accuracy of Eq. (2) vis-à-vis a particular real-world data used as input in the proposed analysis, one can examine the accuracy of the transcendental number , which surely must represent the same precise spherical proportion for any subject as well as in every date as per below:
| (3) |
Where is the circumference and is the diameter of the sphere, both representing stock at time .
Fig. 2.
Equity stock represented as a sphere in at time , with centre at , radius , and volume .
Time-varying radius:
As and are constants (and, therefore, not a function of time) and as is known in advance (because it refers to a given real-world data value used as the main risk proxy in the proposed analysis), in fact the only unknown variable in Eq. (2) is the time-varying radius , which can be found as follows:
| (4) |
Replacing the sphere volume by from Eq. (4) and solving for gives:
| (5) |
Thus, to the extent that time passes and new events succeed (e.g. news in the media, communications made by governments, entry of new competitors, CEO retirement) stocks (represented by spheres) perform a spatial trajectory as well as experience spherical volume variation through time and eventually interact between themselves, reflecting real-world interactions in . For instance, if a company has one of its indicators (e.g. ROI, reflected in the -axis) considered as the respective industry benchmarking (e.g. Apple within the technology industry, Exxon Mobil in the oil and gas sector), then its competitors tend to adopt similar management decisions and take action in an attempt to close the gap related to its own indicator (in this example, ROI) compared to the industry benchmarking one. As a consequence of that competition move, indicators of each stock in the sample tend to become more similar through time and, therefore, closer between themselves in .
Step 2: spatial approximation between assets
After calculating all asset individual risk factors for each stock in the sample in step 1, then it is possible to proceed to calculate as well as perform a visual inspection in to confirm either if there is no risk volume sharing situation between stocks (i.e. case in which there are only trivial intersections between pairs of spheres in ) or if risk volume sharing between one or more pairs of spheres has effectively occurred (i.e. case in which there is at least one non-trivial intersection between spheres in ), as depicted in Figs. 3 and 4 .
Fig. 3.
Non-risk volume sharing situation between stocks and in at time , represented by a trivial intersection in which .
Fig. 4.
Risk volume sharing situation between stocks and in at time , represented by a non-trivial intersection in which .
The measurement of the individual risk factor (i.e. risk proxy of stock or its spherical volume regardless of any intersection at a particular date ) is found in step 1 by calculating the individual volume of the sphere that represents each stock at each particular point in time . If there is an intersection between stocks and in then it is possible to calculate the respective shared value between those two stocks. This shared volume (i.e. non-trivial intersection) occurs due to (i) greater similarity in the spatial trajectory and proximity of the indicators between stocks; (ii) increase in the risk perception proxy measure of each stock, reflected in a greater sphere volume; or (iii) the combined performance of items (i) and (ii).
Asset common risk factor:
The shared volume between stocks and is termed as the common risk factor due to the fact that this value is attributed to individual risk factors and that are shared between the pair of stocks and , as follows:
| (6) |
Analytic geometry is the obvious path to calculate the precise volume of two spheres intersecting in , as graphically depicted in Fig. 5. There is a non-trivial volume intersection between two spheres and in the case that, if and only if:
| (7) |
Where and are the centres of spheres and , respectively, and refers to the Euclidean norm.
Definition 1
The intersection of two spheres is the circumference of a circle whose plane is perpendicular to the line joining the centres of the surfaces and whose centre is in that line [8].
Fig. 5.
Non-trivial intersection between spheres (left) and (right) at time , resulting in the .
The spheres and , of radii and are placed at time and centred at as well as in , respectively. The calculation of the sphere-sphere volume intersection is similar to the circle-circle area intersection, which is in accordance with Definition 1 and which plays an important role in the sphere-sphere volume intersection calculation. Below is detailed the steps of the analytical calculation of a non-trivial intersection between the pair composed by spheres and [[9], [10], [11], [12]].
The equation of sphere is defined in Eq. (8) and the equation of sphere is defined in Eq. (9), as follows:
| (8) |
| (9) |
Combining Eq. (8) and Eq. (9) gives:
| (10) |
Multiplying through and rearranging Eq. (10) result in the following:
| (11) |
Solving Eq. (11) for gives:
| (12) |
The intersection between both spheres refers to a curve lying in a plane parallel to the , -plane at a single -coordinate, which plugging it back into Eq. (8) yields the following:
| (13) |
| (14) |
| (15) |
Which refers to a circle with the following radius :
| (16) |
| (17) |
Thus, the volume of the three-dimensional common lens of spheres and can be found by adding the respective spherical caps. The distances from the centres of spheres and to the bases of the respective caps are given by:
| (18) |
| (19) |
The heights of each of the spherical caps are then calculated as follows:
| (20) |
| (21) |
The volume of a spherical cap of height for a sphere of radius is:
| (22) |
Therefore, in order to specifically find the , it is necessary to sum both spherical caps, as follows:
| (23) |
| (24) |
Where and refer to the volume of the spherical caps of spheres and , respectively. In the case that , the expression above gives as one would expect.
This section considers only the intersection between the two spheres and in at a particular point in time . However, depending on the number of subjects included in the sample, it would be necessary to consider more than two spheres, performing intersection calculation between spheres (representing the stocks in the respective sample) at time . For samples with , computational geometry techniques would be a suitable alternative to reach reasonably approximate spherical volume intersection results through numerical analysis.3
Asset idiosyncratic risk through symmetric difference:
Through symmetric difference one can easily find the idiosyncratic risk measure of stock , termed as , which refers to a risk value attributed only to individual factors of each stock at a particular time . The calculation of this type of risk is performed to provide the remaining volume of the sphere that has no intersection with any other stock in the sample (i.e. total volume of the sphere representing the risk information carried by stock subtracted by the intersecting volume with other stock’s spheres), as follows:
| (25) |
The symmetric difference of and is associative as well as commutative, and can be alternatively denoted using the following notation [13,14]:
| (26) |
The concept of can also be interpreted as follows: although two publicly traded companies are experiencing problems at the same time t, by analysing their respective stock indicators used as input data in the proposed method, those two companies are potentially experiencing distinct problems and that is the reason that stocks and have very low or zero volume risk sharing at a particular point in time. Fig. 6 depicts the concept of the idiosyncratic risk .
Fig. 6.
Idiosyncratic risks and of stocks (left) and (right), respectively.
Finally, in the case that , it can be interpreted that the overall risk of a particular stock is the result of individual (i.e. idiosyncratic) problems that are distinct, or even very distinct, from the problems being experienced by stock and, therefore, both stocks have zero volume risk sharing. Worth noting that care must be taken in order to read the results based on intersecting spherical volume values and then interpreting as well as translating them properly into meaningful similarity and risk measures that may support the decision-making process.
Step 3: clustering pattern indicator
In the last step, the values found in the two previous steps are used as input to an overall clustering pattern indicator, termed as . This indicator aims to measure how similar (or dissimilar) are sample subjects (e.g. stocks) at each point in time.
Times series cluster ratio of over and :
After having found the individual risk factors and as well as the common risk factor between every pair of distinct spheres in the sample – except for intersections between the sphere with itself, such as , which obviously yields – the respective intersecting volumes are placed in the numerator and the sum of individual volumes of all spheres are placed in the denominator of the following time series cluster ratio indicator:
| (27) |
Which can alternatively be written as:
| (28) |
Where
In the case of the series , the lower limit (i.e. zero) refers not only to the usual concept of an almost infinitesimal value, but also to an actual numerical possibility of being zero itself. On the other hand, the upper limit (i.e. +∞) obviously does not refer to an actual value, but instead it is a mathematical concept that conveys the idea of an extremely large, however unreachable, number which in this case would reflect a very large spherical volume.
In the case in which there is a trivial intersection (i.e. ) the numerator would be zero, which would result in . Conversely, in the opposite extreme case in which all spheres in the sample are placed in the exact same three coordinates and have the precise same volume, then . Thus, on one hand, the closer is to one, the more similar and grouped the stocks are in such a given sample. On the other hand, the closer gets to zero, the more dissimilar and separated apart are the stocks from each other. In analyses performed using real-world variables as data input, it is expected that most results should lie within those two extreme values that form the closed interval as well as would be, although not impossible, extremely rare and unlikely to reach the precise upper bound value of one.
Experimental outputs and results
This section contains back-of-the-envelope calculations of two typical cases using synthetic data. In the first one it is considered a static case study in order to discuss what would be explored in a data set only in a particular point in time. The second case consists of a dynamic study of clustering patterns through time.
Static case study
Consider a financial investor who needs to build a portfolio limited to two distinct stocks amongst only four possible alternatives available in the equity market. Therefore, there are four stocks and the respective rate of variation from to of the following four variables are used as input data in this case study: return on investment (ROI); earnings per share (EPS); debt-to-equity ratio (D/E); and stock price simple net return (in absolute value). Subsequently, the first three variables are placed into the -axes, respectively and the fourth variable is reflected in the volume of each of the four spheres , as depicted in Fig. 7.
Fig. 7.
Static case study with a sample based on four stocks (i.e. , , , and ) at time .
The red arrows in Fig. 7 are based on which variations of each of the three axial variables – as detailed in the header of Table 1 – would be beneficial or detrimental according to an investor’s point of view. Therefore, considering a typical rational investor, in virtually all scenarios the greater the ROI (i.e. the further to the right towards the -axis), the better; the greater the EPS (i.e. the further up towards the -axis), the better; and, conversely, the lower the D/E (i.e. the greater the profundity level towards the -axis when the azimuthal angle is at 180°), the better. Overall, a rational investor would seek to maximise the return as well as minimise the risk of the investment. Therefore, in terms of return, an investor would prefer to maximise the positive variation of ROI, EPS, and Stock Return while positive (i.e. financial gain); and, conversely, would prefer to maximise the negative variation of D/E and Stock Return while negative (i.e. financial loss). On the other hand, in terms of risk, such a typical investor would prefer the lowest level of variation in all of these four variables due to the fact that the lower the volatility, the higher the level of predictability of such asset, which would result in a lower level of investment risk.
Table 1.
Input values of stocks , , , and at time .
| Stock | Δ ROI () |
Δ EPS () |
Δ D/E () |
| Δ Stock Return | ( or ) |
|---|---|---|---|---|
| 0.14 | 0.17 | −0.12 | 0.08 | |
| 0.09 | 0.13 | −0.07 | 0.07 | |
| 0.02 | 0.03 | 0.04 | 0.03 | |
| 0.04 | 0.06 | −0.02 | 0.05 |
The simulated values of each of the variables used as input in this example are detailed in Table 1. Based only on a visual inspection on Fig. 7 as well as information provided by Table 1, one can draw the following preliminary and elementary conclusions, according to a rational investor’s point of view:
-
•
Stock is better off than stock , , or ;
-
•
Stock is better off than stock or ;
-
•
Stock is better off than stock ; and
-
•
.
Worth mentioning that one of the main goals of the proposed method is to assess, as impartially as possible, competing asset investment alternatives by measuring the similarity level between each pair of assets in a sample. Ultimately, the proposed method does not judge any of its input variables as beneficial or detrimental for the overall performance of an asset portfolio. For instance, it may be the case that a conservative investor A would prefer a lower level of D/E (e.g. lower probability of the underlying company going bankrupt as a consequence of not being able to repay its debts to creditors), while an investor B, with an aggressive profile, would prefer a higher level of D/E (e.g. expecting a higher profit in the near future due to: (i) a higher gearing ratio, meaning more funding financing company’s projects, which is expected to be transformed into higher profits, and (ii) a greater tax-shield). Therefore, as the proposed method has a data-driven approach, such subjective and qualitative judgment must be made only by the decision maker, who would potentially benefit from non-biased and impartial insights provided by the proposed method.
Subsequently, using data from Table 1 as input data in the proposed method, one can calculate the values of the two intersections involving the whole sample of four stocks, one of them between stocks and and the other one related to stocks and . After the calculation of the respective intersection volumes, it is possible to unveil the outputs shown in Tables 2 and 3 .
Table 2.
Hypothetical common values (i.e. intersecting spherical volumes) between each pair of stocks in the sample.
![]() |
Note: The values on the main diagonal refer to intersections between the sphere with itself, which yields the respective individual spherical volume. Such values are not used in the subsequent calculations/analyses.
Table 3.
Calculated similarity factors between each pair of stocks in the sample.
![]() |
Note: The asterisks (*) refer to not meaningful interactions of the ratio , although still possible to be mathematically calculated. For instance, , in which and .
Therefore, a potential financial investor seeking for stock market investment diversification opportunities can draw insightful conclusions through a visual inspection on Fig. 7 and, most importantly, based on the simulated calculations provided by Tables 2 and 3 such as, but certainly not limited to:
-
•
or does not interact with stock or , and vice-versa;
-
•
is more similar to in comparison with stocks or ;
-
•
is more similar to in comparison with stocks or ;
-
•
is more similar to in comparison with stocks or ;
-
•
is more similar to in comparison with stocks or ;
-
•
There are two data-driven clusters in this sample. Cluster 1 is composed by stocks and and the members of the Cluster 2 are stocks and ;
-
•
Cluster 1 ( and ) is slightly more homogenous than Cluster 2 ( and ) due to the fact that Cluster 1 has a similarity factor mean of 0.60, compared to 0.53 of Cluster 2. The means of the similarity factors are calculated as follows:
Overall, in terms of the pairwise risk-return trade-off, on one hand as a pair of assets and in terms of return, the pair of stocks and overperforms the pair and . On the other hand, for portfolio composition strategy and taking into account diversification purposes in order to decrease the portfolio risk by choosing stocks as distinct as possible within a given pair, a blended portfolio with a stock from Cluster 1 (i.e. or ) and a stock from Cluster 2 (i.e. or ) would result in a higher degree of dissimilarity and would be more suitable in the case the investor aims to build a portfolio as dissimilar as possible (i.e. higher level of diversification given a certain level of return).
Dynamic case study
The rationale behind (detailed in the section Clustering Pattern Indicator) is applied to a sample of two stocks ( and ) in a monthly basis through a whole year (), being each time corresponding to the last trading day of each of the 12 months of the year of 2008, which includes periods during the Global Financial Crisis of 2007–2008 (GFC of 2007–08 hereinafter). The four variables in this hypothetical dynamic case study are: ΔROI, ΔEPS, ΔD/E; and Δ stock price simple net return (in absolute value); all variables varying from to .
As shown in Table 4, the first value in the time series of (i.e. , reflecting January 2008) is the same value from the previous case study (subsection Static Case Study) for each of the two stocks, which is calculated as follows:
Table 4.
Input values, risk-similarity measures, and sample statistics related to stocks and .
![]() |
Eleven months later, in (i.e. December 2008), in the last date of the sample results in:
Worth noting that two data features determine the value of each time in the series : (i) the level of similarity between the volumes of stocks and (which are placed in the denominator of ) – regardless the value of their intersecting volume, and (ii) the intersecting volume itself between both stocks (which is placed in the numerator of ). The greater (i) and (ii), the closer gets from its maximum value of one (i.e. the extreme case in which both stocks would be identical and, therefore, they would have a full non-trivial intersection).
Table 4 contains input values and, most importantly, calculated risk-similarity measures of stocks and in a monthly basis throughout the sample time span as well as basic sample statistics, including the first and second central moments of each variable, with emphasis on the variable (rightmost column).
As depicted in Figs. 8 and 9 , according to the measure there was an overall increase in the similarity level between stocks and throughout the year of 2008. More specifically, there was an increase of 43% from the beginning () to the end of the year (), and this similarity level growth was much steeper from the mid of the year () to the end of the year (), resulting in an increase of 221% from June to December of 2008.
Fig. 8.
Line chart depicting the time series and two major negative events during the GFC of 2007-08, from January 2008 to December 2008.
Fig. 9.
Similarity charts of the time series in three distinct points in time: as of January 2008 (top left), June 2008 (top right), and December 2008 (bottom).
In summary, by reading these outputs one can conclude that the similarity level between both stocks experienced a relevant increase from the beginning to the end of the year of 2008, possibly, among other potential factors, as a consequence of negative events and media news related to the GFC of 2007–08, when both stocks became increasingly more similar between themselves.
Conceptual comparison with well established methods
Overall, the proposed method in this paper can be understood as a non-hierarchical hard cluster analysis. One of the main motivations to create and propose this novel cluster analysis is to address some relevant problems and challenges frequently reported in the cluster analysis literature. More specifically, the following three problems are not currently being treated properly by most traditional clustering methods: (i) setting the number of clusters in a non-arbitrary manner; (ii) possibility of a dataset having zero cluster; and (iii) allocating outliers to no cluster.
The proposed method aims to tackle these three clustering problems at once by adopting a distinct and unusual strategy: imposing an artificial boundary on each sample subject by representing them as spheres instead of points in the three-dimensional Euclidean space. This artificial boundary autonomously (i.e. through a data-driven process) determines the degree of similarity between each pair of subjects in the sample. As the proposed method is based on a data-driven approach, once the input data are included in the model, this method can potentially provide insightful results to the analyst, such as the following two simple examples: (i) in the case only one of the sample subjects is very distinct from all of its counterparts, then this particular subject should not have intersection volume with any other subject in the sample, being represented as an isolated sphere in (i.e. outlier candidate); (ii) in the case every subject in the sample is very dissimilar between themselves, then one feasible result is the possibility of no sphere-sphere intersection, resulting in (i.e. a sample in which there is no cluster being formed “naturally” by the input data).
The aforementioned brief description of improvements and output possibilities refer to an advantage over traditional cluster analyses, such as connectivity models (e.g. HCA) or centroid models (e.g. k-means clustering). Finally, an interesting and more reasonable empirical comparison would be against density models, such as DBSCAN (density-based spatial clustering of applications with noise) and/or OPTICS (ordering points to identify the clustering structure).
Conclusion and future research
Through a more consistent as well as less arbitrary data-driven new hard cluster method, the outputs aim to specifically support financial investors in their portfolio diversification and asset reallocation strategy. The proposed method analyses similarity patterns between pairs of stocks as well as the market clustering structure through time.
The empirical cases based on hypothetical results shown in this paper illustrate how the proposed method can be potentially useful and relevant for financial investment purposes, more specifically for asset reallocation decision making and portfolio diversification strategy. Moreover, the proposed analysis is not limited to the traditional risk-return trade-off analysis, but rather uses a greater number of dimensions based on investor needs and preferences. This type of financial clustering pattern analysis may be useful mainly when equity stocks tend to become more similar (e.g. during crisis or high volatility times) and, therefore, more difficult to clearly distinguish investment alternatives by using traditional cluster methods available.
As suggestions for future research, a natural subsequent stage refers to the use of real-world variables as input data. In fact, the proposed method can potentially be applied to a wide range of fields of knowledge and research problems – as diverse as biology, medicine, geopolitics, engineering, international trade agreements, optimal M&A partnership, marketing, among many more, in order to compare similarity levels between sample subjects and, therefore, not being limited only to applications in finance and economics. In addition, the development of an algorithm (e.g. MatLab code/program) to perform all steps of the proposed method and the use of real-world economic variables as input data on this method are on-going projects. Needless to mention that insights based on outputs to be obtained by the application of the proposed method on distinct data sets remain an empirical question.
Acknowledgements
This paper benefited from stimulating conversations with Philip Arestis, Mardi Dungey, Andrew Harvey, Mohammad Hashem Pesaran, Jacob Rasmussen, John Glascock, John Howe, Neil Dodgson, and Alan Thompson, as well as two anonymous referees for their careful reading and insightful comments and suggestions. All errors remain exclusively the author’s responsibility. The present research was supported by Coordination for the Improvement of Higher Education Personnel of Brazil (CAPES) and The Cambridge Commonwealth, European & International Trust (under grant BEX 2220/15-6).
Footnotes
Both terms (i.e. asset and sphere) will be used interchangeably throughout this paper.
Regarding the index notation adopted in this paper, in an attempt to be as neat as possible, throughout the paper the indexes related to each of the sample subjects (e.g. , ) as well as time (i.e. ) are separated by a vertical bar - i.e. |. Therefore, in the present paper, this symbol does not mean conditional probability, as commonly used in Bayesian statistics.
References
- 1.Andersen T.G., Bollerslev T., Diebold F.X., Ebens H. The distribution of realized stock return volatility. J. Financ. Econ. 2001;61(1):43–76. [Google Scholar]
- 2.Schwert G.W. Business cycles, financial crises, and stock volatility. Carnegie-Rochester Conference Series on Public Policy. 1989;31:83–125. (September) [Google Scholar]
- 3.Reinhart C.M., Rogoff K.S. Princeton University Press; 2009. This Time Is Different: Eight Centuries of Financial Folly. [Google Scholar]
- 4.Arestis P., Karakitsos E. Financial Stability in the Aftermath of the ‘Great Recession’. Palgrave Macmillan; London: 2013. Lessons from the ‘Great Recession’ for both theory and economic policy; pp. 164–192. [Google Scholar]
- 5.Jain A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010;31(8):651–666. [Google Scholar]
- 6.Hennig C., Meila M. Handbook of Cluster Analysis. CRC Press, Taylorand Francis Group; 2015. Cluster analysis: an overview; pp. 1–20. [Google Scholar]
- 7.Tsay R.S. vol. 543. John Wiley & Sons; 2005. (Analysis of Financial Time Series). [Google Scholar]
- 8.Kern W.F., Bland J.R. J. Wiley & Sons, Incorporated; 1938. Solid Mensuration: With Proofs. [Google Scholar]
- 9.Alvord B. The intersection of circles and the intersection of spheres. Am. J. Math. 1882;5(1):25–44. [Google Scholar]
- 10.Court N.A. Four intersecting spheres. Am. Math. Mon. 1960;67(3):241–248. [Google Scholar]
- 11.Serezhkin V.N., Mikhailov Y.N., Buslaev Y.A. The method of intersecting spheres for determination of coordination numbers of atoms in crystal structures. Russ. J. Inorg. Chem. 1997;42(12):1871–1910. [Google Scholar]
- 12.Weisstein E.W. 2007. Sphere-Sphere Intersection. [accessed 2019]. Available at: http://mathworld.wolfram.com/Sphere-SphereIntersection.html. [Google Scholar]
- 13.Borowski E.J., Borwein J.M. HarperCollins; 1991. The HarperCollins Dictionary of Mathematics. [Google Scholar]
- 14.Alt H., Fuchs U., Rote G., Weber G. Matching convex shapes with respect to the symmetric difference. Algorithmica. 1998;21(1):89–103. [Google Scholar]
- 15.Strobl S., Formella A., Pöschel T. Exact calculation of the overlap volume of spheres and mesh elements. J. Comput. Phys. 2016;311:158–172. [Google Scholar]
- 16.George P.L., Hecht F., Saltel É. Automatic mesh generator with specified boundary. Comput. Methods Appl. Mech. Eng. 1991;92(3):269–288. [Google Scholar]













