Skip to main content
MethodsX logoLink to MethodsX
. 2019 May 25;6:1261–1278. doi: 10.1016/j.mex.2019.05.025

Sphere-sphere intersection for investment portfolio diversification — A new data-driven cluster analysis

Michel Ferreira Cardia Haddad 1
PMCID: PMC6909007  PMID: 31871911

Graphical abstract

graphic file with name fx1.jpg

Method name: Data-driven cluster-similarity analysis

Keywords: Cluster analysis, Similarity measure, Asset allocation, Risk analysis, Investment decisions

Abstract

Aiming at supporting the process of investment portfolio diversification by using a data-driven approach, the present methodological paper proposes a new cluster analysis, which compares publicly traded companies, mainly in times of high volatility (e.g. crisis times). The main goal of the proposed method is to provide a less arbitrary analysis to support financial investors to precisely measure the degree of similarity between equity stocks, unveiling equity market clustering patterns by applying analytic geometry solutions and calculating an overall clustering pattern indicator. Empirical results on synthetic data demonstrate either that the proposed method has conceptual superiority over traditional cluster analyses and its potential practical usefulness to asset allocation, portfolio strategy, asset pricing, among other related purposes. Finally, the outputs of the proposed cluster analysis are presented through an intuitive and easily understandable mathematical visualization.

  • It is proposed a new method to calculate risk-similarity and clustering patterns.

  • The method unveils clustering patterns through a data-driven process.

  • Portfolio diversification can benefit from sphere-sphere intersection calculations.


Specifications Table

Subject Area: Social Sciences
More specific subject area: Econophysics
Method name: Data-driven Cluster-similarity Analysis
Name and reference of original method: Main original method names: k-means Clustering and Hierarchical Cluster Analysis (HCA)
Main original method reference: Jain, A.K., 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), pp. 651–666
Resource availability: N/A

Method details

Introduction

The greater frequency and impact of financial and economic crises requires the development of feasible methods that aim to clearly distinguish investment alternatives in a robust, consistent, and coherent manner [[1], [2], [3], [4]]. The main goal of the proposed data-driven cluster-similarity risk method is to support financial investors to precisely measure the level of similarity between publicly listed companies through time - especially in turbulent periods, showing the outputs through a clear graphical representation as a by-product of the analysis. As depicted in the Graphical Abstract, the method is performed through three subsequent steps, in which each asset1 is represented by a correspondent 2-sphere (i.e. ordinary three-dimensional sphere) in the three-dimensional Euclidean space (R3).

In step one is performed an analysis over the trajectory of each equity stock through time as well as the calculation of the variation of the individual risk factor – termed2 as IRFi|t – by finding the correspondent 2-sphere (simply sphere onwards) radius. Subsequently, in step two the spatial approximation between assets/spheres is visualized and the intersection volume – termed as CRFi,j|t – between every pair of spheres in the sample is calculated using analytical geometry. Finally, in step three either the individual spherical volumes and the intersection volumes calculated in the previous steps are used as input values placed into an overall clustering pattern indicator bounded from zero to one, which is a proxy measure designed to meaningfully assess the level of shared risk between all q stocks in a sample based on their level of similarity at any particular date as well as through time t.

Most of the labour involved in the proposed method occurs in the first two steps, which refer to geometrical calculations and computations involving spheres and intersection volumes in the three-dimensional Euclidean space. However, the labour required is much compensated by the precise outputs generated, which results in consistent estimates as well as clear mathematical visualization of the calculations and analyses performed. The motivations to create and propose this non-hierarchical clustering method are based on an attempt to tackle some relevant problems and challenges often reported in the cluster analysis literature, which traditional clustering methods – such as k-means clustering and hierarchical cluster analysis (HCA) – do not address properly [5,6], such as: (i) setting the number of clusters in a non-arbitrary manner (i.e. through a data-driven process); (ii) possibility of a dataset having zero cluster; and (iii) allocating outliers to no cluster.

Although the proposed method was primarily designed to work with quantitative continuous variables as input data, it is possible to use virtually any type of data (e.g. nominal, ordinal, binary) to be placed into any of the three axes in R3. An important characteristic of using such non-continuous variables is that the spatial distance between each value in the axis must be equally spaced. Therefore, a relevant caveat of using such variables is that this equal spatial distance between each value needs to be set arbitrarily by the analyst. This arbitrary setting can potentially distort the analysis performed, by allowing the analyst to manipulate the outputs – for instance, by setting a very large spatial distance between each value in the axis in order to decrease the number of spherical volume intersections and, consequently, artificially reducing the number of clusters in a particular sample, which is extremely undesirable. This problem is avoided once input data are based on continuous variables since the spatial distance between each value in the axis would be determined solely by the data itself (i.e. data-driven), following a predetermined standard scale (e.g. company revenue, inflation rate, exchange rate).

Following this introduction, the paper is divided into five sections and it proceeds as follows. Sections Asset Individual Trajectory (Step 1), Spatial Approximation between Assets (Step 2) and Clustering Pattern Indicator (Step 3) explore and detail the first, second, and third steps of the proposed method, respectively. Section Experimental Outputs and Results provides two case studies and a brief discussion over the results. Finally, the last section concludes and suggests extensions for future research.

Step 1: asset individual trajectory

The proposed analysis starts by verifying the spatial trajectory of each stock (based on their respective axial variables) as well as the contemporaneous progress of their respective individual risk factor (based on their respective spherical volume). There are four values to be used as data input in the proposed analysis: the first, second, and third values are reflected on the x-axis, y-axis, and z-axis, respectively; and the fourth value is reflected in the volume of each sphere (i.e. Vi|t or IRFi|t).

Each of those four values (i.e. xi|t, yi|t, zi|t, and Vi|t) is based on the rate of variation (from t − 1 to t) of four real-world variables (i.e. ςi|t, κi|t, ωi|t, and νi|t), respectively. The rate of variation of each of the four real-world variables to be placed into each of the three axes as well as reflected in the sphere volume are calculated as follows [7]:

xi|t=ςi|tςi|t1ςi|t,xi|t,ςi|tRyi|t=κi|tκi|t1κi|t,yi|t,κi|tRzi|t=ωi|tωi|t1ωi|t,zi|t,ωi|tRVi|t=|νi|tνi|t1νi|t|,Vi|tR+,νi|tR (1)

Where q=1,2,i,j,,Q-1,Q and t=1,2,,T-1,T, which refer to the number of subjects and time span of the sample, respectively. Throughout this paper and in distinct contexts, the subjects i and j are frequently used to exemplify two generic sample subjects. Worth noting that the fourth variable is in absolute value, which reflects the volume of the respective sphere in R3. Therefore, the time series placed into the three axes as well as the one used as the spherical volume are sequenced as follows:

xi|t=x1|1,,x1|T;x2|1,,x2|T;;xQ|1,,xQ|T
yi|t=y1|1,,y1|T;y2|1,,y2|T;;yQ|1,,yQ|T
zi|t=z1|1,,z1|T;z2|1,,z2|T;;zQ|1,,zQ|T,and
IRFi|tVi|t=V1|1,,V1|T;V2|1,,V2|T;;VQ|1,,VQ|T

Each sphere represents a different stock within a particular industry in the three-dimensional Euclidean space. For instance, in the automotive industry would be reasonable to compared stock indicators (i.e. data input) from companies such as Fiat, Ford, General Motors, Toyota, and Volkswagen; in the technology industry, would be analysed data from companies such as Alphabet/Google, Amazon, Apple, Facebook, and Microsoft; among many more possibilities.

The volume of each sphere is analytically calculated and graphically depicted in R3. The spherical volume aims to reflect a variable that is recognised as the best proxy available to measure the level of risk of a particular data set. This variable should be relevant and meaningful according to the point of view of the investor/ analyst who is performing the analysis. Moreover, this risk proxy should account for industry context and idiosyncrasies (e.g. levels of expected stock price return, equity-to-debt ratio, market capitalization, or any other well justified indicator). As illustrated in Fig. 1, this risk proxy varies dynamically according to each different time t as well as distinct stock in the sample. Earnings per share (EPS), price to earnings (P/E) ratio, price to book value (P/B) ratio, and dividend yield are just a few examples, among a wide range of possibilities, of real-world economic variables and/or financial indicators to be placed at any of three axes in R3 (Fig. 1).

Fig. 1.

Fig. 1

Visualisation of spatial trajectories of stocks Si|t and Sj|t in R3 from t=1 (left) to t=2 (right).

As applicable to any quantitative analysis, it is of utmost importance that the analyst selects coherently and justifies properly each variable to be used as input data to perform the proposed method, specially the fourth and most important variable to be reflected in the volume of the respective sphere (i.e. IRFi|t). The analysis over the question on how to choose the most appropriate variables is out of the scope of this paper, which can be based on a series of factors, such as individual investment preferences, risk tolerance, market consensus, among many others.

The reason behind the analysis being performed in the three-dimensional space R3 relies on the fact that, compared to the two-dimensional space R2, one additional variable is included in the analysis. This additional variable enriches the analysis by allowing more information being added as input data. Moreover, although feasible, the reason to not propose the present method in an n -dimensional Euclidean space Rn relies on the fact that, on one hand the analysist can use n variables instead of three, which is desirable. On the other hand, however, in Rn there would not be a consistent graphical visualisation of the outputs.

Asset individual risk factor: IRFi|t

As depicted in Fig. 2, each equity stock is represented by its respective sphere in R3 and their individual volume is given by a real-world economic or financial variable, termed as IRFi|t, which is calculated as follows:

IRFi|tVi|t=ri|t3π3/2Γ32+1=ri|t343π,IRFi|tR+,iN,tZ+ (2)

Where ri|t is the time-varying radius of the sphere, π is the well-known ratio of the sphere’s circumference to its diameter, and Γ is the gamma function; all variables refer to stock Si|t at time t. The formulas in Eq. (2) refer to the classical formulas to calculate the volume of a sphere in a three-dimensional Euclidean space (R3) and it was adapted to the present paper in order to incorporate the temporal variation reflected in the volume as well as the time-varying radius of the respective sphere. In addition, in the case one wants to check the accuracy of Eq. (2) vis-à-vis a particular real-world data used as input in the proposed analysis, one can examine the accuracy of the transcendental number π, which surely must represent the same precise spherical proportion for any subject as well as in every date t as per below:

ci|tDi|tci|t2ri|tπ3.14159265358979 (3)

Where ci|t is the circumference and Di|t is the diameter of the sphere, both representing stock Si|t at time t.

Fig. 2.

Fig. 2

Equity stock Si|t represented as a sphere in R3 at time t, with centre Ci|t at xi|t,yi|t,zi|t, radius ri|t, and volume Vi|t.

Time-varying radius: ri|t

As π and Γ are constants (and, therefore, not a function of time) and as IRFi|t is known in advance (because it refers to a given real-world data value used as the main risk proxy in the proposed analysis), in fact the only unknown variable in Eq. (2) is the time-varying radius ri|t, which can be found as follows:

Vi|tri|t3=43π (4)

Replacing the sphere volume Vi|t by IRFi|t from Eq. (4) and solving for ri|t gives:

ri|t=1(IRFi|t1)343π (5)

Thus, to the extent that time t passes and new events succeed (e.g. news in the media, communications made by governments, entry of new competitors, CEO retirement) stocks (represented by spheres) perform a spatial trajectory as well as experience spherical volume variation through time and eventually interact between themselves, reflecting real-world interactions in R3. For instance, if a company has one of its indicators (e.g. ROI, reflected in the x-axis) considered as the respective industry benchmarking (e.g. Apple within the technology industry, Exxon Mobil in the oil and gas sector), then its competitors tend to adopt similar management decisions and take action in an attempt to close the gap related to its own indicator (in this example, ROI) compared to the industry benchmarking one. As a consequence of that competition move, indicators of each stock in the sample tend to become more similar through time and, therefore, closer between themselves in R3.

Step 2: spatial approximation between assets

After calculating all asset individual risk factors for each stock in the sample in step 1, then it is possible to proceed to calculate as well as perform a visual inspection in R3 to confirm either if there is no risk volume sharing situation between stocks (i.e. case in which there are only trivial intersections between pairs of spheres in R3) or if risk volume sharing between one or more pairs of spheres has effectively occurred (i.e. case in which there is at least one non-trivial intersection between spheres in R3), as depicted in Figs. 3 and 4 .

Fig. 3.

Fig. 3

Non-risk volume sharing situation between stocks Si|t and Sj|t in R3 at time t, represented by a trivial intersection in which Si|tSj|t=.

Fig. 4.

Fig. 4

Risk volume sharing situation between stocks Si|t and Sj|t in R3 at time t, represented by a non-trivial intersection in which Si|tSj|t.

The measurement of the individual risk factor IRFi|t (i.e. risk proxy of stock Si|t or its spherical volume regardless of any intersection at a particular date t) is found in step 1 by calculating the individual volume of the sphere that represents each stock at each particular point in time t. If there is an intersection between stocks Si|t and Sj|t in R3 then it is possible to calculate the respective shared value between those two stocks. This shared volume (i.e. non-trivial intersection) occurs due to (i) greater similarity in the spatial trajectory and proximity of the indicators between stocks; (ii) increase in the risk perception proxy measure of each stock, reflected in a greater sphere volume; or (iii) the combined performance of items (i) and (ii).

Asset common risk factor: CRFi,j|t

The shared volume between stocks Si|t and Sj|t is termed as the common risk factor CRFi,j|t due to the fact that this value is attributed to individual risk factors IRFi|t and IRFj|t that are shared between the pair of stocks Si|t and Sj|t, as follows:

CRFi,j|tSi|tSj|t (6)

Analytic geometry is the obvious path to calculate the precise volume of two spheres intersecting in R3, as graphically depicted in Fig. 5. There is a non-trivial volume intersection between two spheres Si|t and Sj|t in the case that, if and only if:

Ci|t-Cj|t<ri|t+rj|t,  Si|tS⁣⁣j|t (7)

Where Ci|t and Cj|t are the centres of spheres Si|t and Sj|t, respectively, and refers to the Euclidean norm.

Definition 1

The intersection of two spheres is the circumference of a circle whose plane is perpendicular to the line joining the centres of the surfaces and whose centre is in that line [8].

Fig. 5.

Fig. 5

Non-trivial intersection between spheres Si|t (left) and Sj|t (right) at time t, resulting in the CRFi,j|t.

The spheres Si|t and Sj|t, of radii ri|t and rj|t are placed at time t and centred at Ci|t0,0,0 as well as Cj|tdj|t,0,0 in R3, respectively. The calculation of the sphere-sphere volume intersection is similar to the circle-circle area intersection, which is in accordance with Definition 1 and which plays an important role in the sphere-sphere volume intersection calculation. Below is detailed the steps of the analytical calculation of a non-trivial intersection between the pair composed by spheres Si|t and Sj|t [[9], [10], [11], [12]].

The equation of sphere Si|t is defined in Eq. (8) and the equation of sphere Sj|t is defined in Eq. (9), as follows:

ri|t2=xi|t2+yi|t2+zi|t2 (8)
rj|t2=xi|t-dj|t2+yi|t2+zi|t2 (9)

Combining Eq. (8) and Eq. (9) gives:

rj|t2=xi|t-dj|t2+ri|t2-xi|t2 (10)

Multiplying through and rearranging Eq. (10) result in the following:

xi|t2-2dj|txi|t+dj|t2-xi|t2=rj|t2-ri|t2 (11)

Solving Eq. (11) for xi|t gives:

xi|t=dj|t2-rj|t2+ri|t22dj|t (12)

The intersection between both spheres refers to a curve lying in a plane parallel to the yi|t, zi|t-plane at a single xi|t-coordinate, which plugging it back into Eq. (8) yields the following:

yi|t2+zi|t2=ri|t2-xi|t2 (13)
yi|t2+zi|t2=ri|t2-dj|t2-rj|t2+ri|t22dj|t2 (14)
yi|t2+zi|t2=4dj|t2ri|t2-dj|t2-rj|t2+ri|t224dj|t2 (15)

Which refers to a circle with the following radius rλ|t:

rλ|t=12dj|t4dj|t2ri|t2-dj|t2-rj|t2+ri|t22 (16)
rλ|t=12dj|t-dj|t+rj|t-ri|t-dj|t-rj|t+ri|t-dj|t+rj|t+ri|tdj|t+rj|t+ri|t (17)

Thus, the volume of the three-dimensional common lens of spheres Si|t and Sj|t can be found by adding the respective spherical caps. The distances from the centres of spheres Si|t and Sj|t to the bases of the respective caps are given by:

dα|t=xi|t (18)
dβ|t=dj|t-xi|t (19)

The heights of each of the spherical caps are then calculated as follows:

hi|t=xi|t-dα|t=rj|t-ri|t+dj|trj|t+ri|t-dj|t2dj|t (20)
hj|t=rj|t-dβ|t=ri|t-rj|t+dj|tri|t+rj|t-dj|t2dj|t (21)

The volume of a spherical cap of height ht' for a sphere of radius rt' is:

Vcaptrt',ht'=13πht'23rt'-ht' (22)

Therefore, in order to specifically find the CRFi,j|t, it is necessary to sum both spherical caps, as follows:

CRFi,j|t=Vcapi|tri|t,hi|t+Vcapj|trj|t,hj|t (23)
CRFi,j|t=πri|t+rj|t-dj|t2dj|t2+2dj|trj|t-3rj|t2+2dj|tri|t+6rj|tri|t-3ri|t212dj|t (24)

Where Vcapi|t and Vcapj|t refer to the volume of the spherical caps of spheres Si|t and Sj|t, respectively. In the case that dj|t=rj|t+ri|t, the expression above gives CRFi,j|t=0 as one would expect.

This section considers only the intersection between the two spheres Si|t and Sj|t in R3 at a particular point in time t. However, depending on the number of subjects q included in the sample, it would be necessary to consider more than two spheres, performing intersection calculation between q spheres (representing the q stocks in the respective sample) at time t. For samples with q>2, computational geometry techniques would be a suitable alternative to reach reasonably approximate spherical volume intersection results through numerical analysis.3

Asset idiosyncratic risk through symmetric difference: IRi|t

Through symmetric difference one can easily find the idiosyncratic risk measure of stock Si|t, termed as IRi|t, which refers to a risk value attributed only to individual factors of each stock at a particular time t. The calculation of this type of risk is performed to provide the remaining volume of the sphere that has no intersection with any other stock in the sample (i.e. total volume of the sphere representing the risk information carried by stock Si|t subtracted by the intersecting volume with other stock’s spheres), as follows:

IRi|tIRFi|t  (IRFi|tIRFj|t) (25)

The symmetric difference of IRFi|t and IRFj|t is associative as well as commutative, and can be alternatively denoted using the following notation [13,14]:

IRFi|tIRFj|t=IRFj|tIRFi|t (26)

The concept of IRi|t can also be interpreted as follows: although two publicly traded companies are experiencing problems at the same time t, by analysing their respective stock indicators used as input data in the proposed method, those two companies are potentially experiencing distinct problems and that is the reason that stocks Si|t and Sj|t have very low or zero volume risk sharing at a particular point in time. Fig. 6 depicts the concept of the idiosyncratic risk IRi|t.

Fig. 6.

Fig. 6

Idiosyncratic risks IRi|t and IRj|t of stocks Si|t (left) and Sj|t (right), respectively.

Finally, in the case that IRFi|tIRFj|t=0, it can be interpreted that the overall risk of a particular stock Si|t is the result of individual (i.e. idiosyncratic) problems that are distinct, or even very distinct, from the problems being experienced by stock Sj|t and, therefore, both stocks have zero volume risk sharing. Worth noting that care must be taken in order to read the results based on intersecting spherical volume values and then interpreting as well as translating them properly into meaningful similarity and risk measures that may support the decision-making process.

Step 3: clustering pattern indicator

In the last step, the values found in the two previous steps are used as input to an overall clustering pattern indicator, termed as Rt. This indicator aims to measure how similar (or dissimilar) are sample subjects (e.g. stocks) at each point in time.

Times series cluster ratio of CRFi,j|t over IRFi|t and IRFj|t: Rt

After having found the individual risk factors IRFi|t and IRFj|t as well as the common risk factor CRFi,j|t between every pair of distinct spheres in the sample – except for intersections between the sphere with itself, such as V1|tV1|t, which obviously yields V1|t – the respective intersecting volumes are placed in the numerator and the sum of individual volumes of all spheres are placed in the denominator of the following time series cluster ratio indicator:

RtIRF1|tIRF2|t+IRF2|tIRF1|t++IRFQ|tIRFQ-1|t+IRFQ-1|tIRFQ|tIRF1|t+IRF2|t++IRFQ-1|t+IRFQ|t (27)

Which can alternatively be written as:

RtCRF1,2|t+CRF2,1|t++CRFQ-1,Q|t+CRFQ,Q-1|tIRF1|t+IRF2|t++IRFQ-1|t+IRFQ|t (28)

Where

0Rt1,RtQ+
0IRF1|t,,IRFQ|t<+,IRF1|t,,IRFQ|tR+
0(CRF1,2|t)++(CRFQ,Q-1|t)(IRF1|t++IRFQ|t),(IRF1|t++IRFQ|t)R+  and(CRF1,2|t)++(CRFQ,Q-1|t)R+

In the case of the series IRF1|t,,IRFQ|t, the lower limit (i.e. zero) refers not only to the usual concept of an almost infinitesimal value, but also to an actual numerical possibility of being zero itself. On the other hand, the upper limit (i.e. +∞) obviously does not refer to an actual value, but instead it is a mathematical concept that conveys the idea of an extremely large, however unreachable, number which in this case would reflect a very large spherical volume.

In the case in which there is a trivial intersection (i.e. Si|tSj|t=) the numerator would be zero, which would result in Rt=0. Conversely, in the opposite extreme case in which all spheres in the sample are placed in the exact same three coordinates and have the precise same volume, then Rt=1. Thus, on one hand, the closer Rt is to one, the more similar and grouped the stocks are in such a given sample. On the other hand, the closer Rt gets to zero, the more dissimilar and separated apart are the stocks from each other. In analyses performed using real-world variables as data input, it is expected that most results should lie within those two extreme values that form the closed interval 0,1 as well as would be, although not impossible, extremely rare and unlikely to reach the precise upper bound value of one.

Experimental outputs and results

This section contains back-of-the-envelope calculations of two typical cases using synthetic data. In the first one it is considered a static case study in order to discuss what would be explored in a data set only in a particular point in time. The second case consists of a dynamic study of clustering patterns through time.

Static case study

Consider a financial investor who needs to build a portfolio limited to two distinct stocks amongst only four possible alternatives available in the equity market. Therefore, there are four stocks SA|t,SB|t,SC|t,SD|t and the respective rate of variation from t-1 to t of the following four variables are used as input data in this case study: return on investment (ROI); earnings per share (EPS); debt-to-equity ratio (D/E); and stock price simple net return (in absolute value). Subsequently, the first three variables are placed into the x,y,z-axes, respectively and the fourth variable is reflected in the volume of each of the four spheres IRFA|t,IRFB|t,IRFC|t,IRFD|t, as depicted in Fig. 7.

Fig. 7.

Fig. 7

Static case study with a sample based on four stocks (i.e. SA|t, SB|t, SC|t, and SD|t) at time t=1.

The red arrows in Fig. 7 are based on which variations of each of the three axial variables – as detailed in the header of Table 1 – would be beneficial or detrimental according to an investor’s point of view. Therefore, considering a typical rational investor, in virtually all scenarios the greater the ROI (i.e. the further to the right towards the x-axis), the better; the greater the EPS (i.e. the further up towards the y-axis), the better; and, conversely, the lower the D/E (i.e. the greater the profundity level towards the z-axis when the azimuthal angle is at 180°), the better. Overall, a rational investor would seek to maximise the return as well as minimise the risk of the investment. Therefore, in terms of return, an investor would prefer to maximise the positive variation of ROI, EPS, and Stock Return while positive (i.e. financial gain); and, conversely, would prefer to maximise the negative variation of D/E and Stock Return while negative (i.e. financial loss). On the other hand, in terms of risk, such a typical investor would prefer the lowest level of variation in all of these four variables due to the fact that the lower the volatility, the higher the level of predictability of such asset, which would result in a lower level of investment risk.

Table 1.

Input values of stocks SA|t, SB|t, SC|t, and SD|t at time t=1.

Stock Δ ROI
(xi|1)
Δ EPS
(yi|1)
Δ D/E
(zi|1)
| Δ Stock Return |
(IRFi|1 or Vi|1)
SA|1 0.14 0.17 −0.12 0.08
SB|1 0.09 0.13 −0.07 0.07
S⁣⁣C|1 0.02 0.03 0.04 0.03
SD|1 0.04 0.06 −0.02 0.05

The simulated values of each of the variables used as input in this example are detailed in Table 1. Based only on a visual inspection on Fig. 7 as well as information provided by Table 1, one can draw the following preliminary and elementary conclusions, according to a rational investor’s point of view:

  • Stock SA|1 is better off than stock SB|1, SD|1, or SC|1;

  • Stock SB|1 is better off than stock SD|1 or SC|1;

  • Stock SD|1 is better off than stock SC|1; and

  • IRFA|1>IRFB|1>IRFD|1>IRFC|1.

Worth mentioning that one of the main goals of the proposed method is to assess, as impartially as possible, competing asset investment alternatives by measuring the similarity level between each pair of assets in a sample. Ultimately, the proposed method does not judge any of its input variables as beneficial or detrimental for the overall performance of an asset portfolio. For instance, it may be the case that a conservative investor A would prefer a lower level of D/E (e.g. lower probability of the underlying company going bankrupt as a consequence of not being able to repay its debts to creditors), while an investor B, with an aggressive profile, would prefer a higher level of D/E (e.g. expecting a higher profit in the near future due to: (i) a higher gearing ratio, meaning more funding financing company’s projects, which is expected to be transformed into higher profits, and (ii) a greater tax-shield). Therefore, as the proposed method has a data-driven approach, such subjective and qualitative judgment must be made only by the decision maker, who would potentially benefit from non-biased and impartial insights provided by the proposed method.

Subsequently, using data from Table 1 as input data in the proposed method, one can calculate the values of the two intersections involving the whole sample of four stocks, one of them between stocks SA|1 and SB|1 and the other one related to stocks SC|1 and SD|1. After the calculation of the respective intersection volumes, it is possible to unveil the outputs shown in Tables 2 and 3 .

Table 2.

Hypothetical common values (i.e. intersecting spherical volumes) between each pair of stocks in the sample.

graphic file with name fx2.gif

Note: The values on the main diagonal refer to intersections between the sphere with itself, which yields the respective individual spherical volume. Such values are not used in the subsequent calculations/analyses.

Table 3.

Calculated similarity factors between each pair of stocks in the sample.

graphic file with name fx3.gif

Note: The asterisks (*) refer to not meaningful interactions of the ratio CRFi,j|t/IRFi|t, although still possible to be mathematically calculated. For instance, CRFA,B|1/IRFC|1, in which SC|1SA|1= and SC|1SB|1=.

Therefore, a potential financial investor seeking for stock market investment diversification opportunities can draw insightful conclusions through a visual inspection on Fig. 7 and, most importantly, based on the simulated calculations provided by Tables 2 and 3 such as, but certainly not limited to:

  • SA|1 or SB|1 does not interact with stock SC|1 or SD|1, and vice-versa;

  • SA|1 is more similar to SB|1 in comparison with stocks SC|1 or SD|1;

  • SB|1 is more similar to SA|1 in comparison with stocks SC|1 or SD|1;

  • SC|1 is more similar to SD|1 in comparison with stocks SA|1 or SB|1;

  • SD|1 is more similar to SC|1 in comparison with stocks SA|1 or SB|1;

  • There are two data-driven clusters in this sample. Cluster 1 is composed by stocks SA|1 and SB|1 and the members of the Cluster 2 are stocks SC|1 and SD|1;

  • Cluster 1 (SA|1 and SB|1) is slightly more homogenous than Cluster 2 (SC|1 and SD|1) due to the fact that Cluster 1 has a similarity factor mean of 0.60, compared to 0.53 of Cluster 2. The means of the similarity factors are calculated as follows:

Cluster1=CRFA,B|1+CRFB,A|1IRFA|1+IRFB|1=0.045+0.0450.08+0.07=0.600
Cluster2=CRFC,D|1+CRFD,C|1IRFC|1+IRFD|1=0.021+0.0210.03+0.05=0.525

Overall, in terms of the pairwise risk-return trade-off, on one hand as a pair of assets and in terms of return, the pair of stocks SA|1 and SB|1 overperforms the pair SC|1 and SD|1. On the other hand, for portfolio composition strategy and taking into account diversification purposes in order to decrease the portfolio risk by choosing stocks as distinct as possible within a given pair, a blended portfolio with a stock from Cluster 1 (i.e. SA|1 or SB|1) and a stock from Cluster 2 (i.e. SC|1 or SD|1) would result in a higher degree of dissimilarity and would be more suitable in the case the investor aims to build a portfolio as dissimilar as possible (i.e. higher level of diversification given a certain level of return).

Dynamic case study

The rationale behind Rt (detailed in the section Clustering Pattern Indicator) is applied to a sample of two stocks (SA|t and SB|t) in a monthly basis through a whole year (T=12), being each time t corresponding to the last trading day of each of the 12 months of the year of 2008, which includes periods during the Global Financial Crisis of 2007–2008 (GFC of 2007–08 hereinafter). The four variables in this hypothetical dynamic case study are: ΔROI, ΔEPS, ΔD/E; and Δ stock price simple net return (in absolute value); all variables varying from t=1 to 12.

As shown in Table 4, the first value in the time series of Rt (i.e. t=1, reflecting January 2008) is the same value from the previous case study (subsection Static Case Study) for each of the two stocks, which is calculated as follows:

R1=2IRFA|1IRFB|1IRFA|1+IRFB|1=20.0450.08+0.07=0.600
Table 4.

Input values, risk-similarity measures, and sample statistics related to stocks SA|t and SB|t.

graphic file with name fx4.gif

Eleven months later, in t=T=12 (i.e. December 2008), Rt in the last date of the sample results in:

R12=2IRFA|12IRFB|12IRFA|12+IRFB|12=20.1070.130+0.120=0.856

Worth noting that two data features determine the value of each time t in the series Rt: (i) the level of similarity between the volumes of stocks SA|t and SB|t (which are placed in the denominator of Rt) – regardless the value of their intersecting volume, and (ii) the intersecting volume itself between both stocks (which is placed in the numerator of Rt). The greater (i) and (ii), the closer Rt gets from its maximum value of one (i.e. the extreme case in which both stocks would be identical and, therefore, they would have a full non-trivial intersection).

Table 4 contains input values and, most importantly, calculated risk-similarity measures of stocks SA|t and SB|t in a monthly basis throughout the sample time span as well as basic sample statistics, including the first and second central moments of each variable, with emphasis on the Rt variable (rightmost column).

As depicted in Figs. 8 and 9 , according to the Rt measure there was an overall increase in the similarity level between stocks SA|t and SB|t throughout the year of 2008. More specifically, there was an increase of 43% from the beginning (R1=0.60) to the end of the year (R12=0.86), and this similarity level growth was much steeper from the mid of the year (R6=0.27) to the end of the year (R12=0.86), resulting in an increase of 221% from June to December of 2008.

Fig. 8.

Fig. 8

Line chart depicting the Rt time series and two major negative events during the GFC of 2007-08, from January 2008 to December 2008.

Fig. 9.

Fig. 9

Similarity charts of the Rt time series in three distinct points in time: as of January 2008 (top left), June 2008 (top right), and December 2008 (bottom).

In summary, by reading these outputs one can conclude that the similarity level between both stocks experienced a relevant increase from the beginning to the end of the year of 2008, possibly, among other potential factors, as a consequence of negative events and media news related to the GFC of 2007–08, when both stocks became increasingly more similar between themselves.

Conceptual comparison with well established methods

Overall, the proposed method in this paper can be understood as a non-hierarchical hard cluster analysis. One of the main motivations to create and propose this novel cluster analysis is to address some relevant problems and challenges frequently reported in the cluster analysis literature. More specifically, the following three problems are not currently being treated properly by most traditional clustering methods: (i) setting the number of clusters in a non-arbitrary manner; (ii) possibility of a dataset having zero cluster; and (iii) allocating outliers to no cluster.

The proposed method aims to tackle these three clustering problems at once by adopting a distinct and unusual strategy: imposing an artificial boundary on each sample subject by representing them as spheres instead of points in the three-dimensional Euclidean space. This artificial boundary autonomously (i.e. through a data-driven process) determines the degree of similarity between each pair of subjects in the sample. As the proposed method is based on a data-driven approach, once the input data are included in the model, this method can potentially provide insightful results to the analyst, such as the following two simple examples: (i) in the case only one of the sample subjects is very distinct from all of its counterparts, then this particular subject should not have intersection volume with any other subject in the sample, being represented as an isolated sphere in R3 (i.e. outlier candidate); (ii) in the case every subject in the sample is very dissimilar between themselves, then one feasible result is the possibility of no sphere-sphere intersection, resulting in Rt=0 (i.e. a sample in which there is no cluster being formed “naturally” by the input data).

The aforementioned brief description of improvements and output possibilities refer to an advantage over traditional cluster analyses, such as connectivity models (e.g. HCA) or centroid models (e.g. k-means clustering). Finally, an interesting and more reasonable empirical comparison would be against density models, such as DBSCAN (density-based spatial clustering of applications with noise) and/or OPTICS (ordering points to identify the clustering structure).

Conclusion and future research

Through a more consistent as well as less arbitrary data-driven new hard cluster method, the outputs aim to specifically support financial investors in their portfolio diversification and asset reallocation strategy. The proposed method analyses similarity patterns between pairs of stocks as well as the market clustering structure through time.

The empirical cases based on hypothetical results shown in this paper illustrate how the proposed method can be potentially useful and relevant for financial investment purposes, more specifically for asset reallocation decision making and portfolio diversification strategy. Moreover, the proposed analysis is not limited to the traditional risk-return trade-off analysis, but rather uses a greater number of dimensions based on investor needs and preferences. This type of financial clustering pattern analysis may be useful mainly when equity stocks tend to become more similar (e.g. during crisis or high volatility times) and, therefore, more difficult to clearly distinguish investment alternatives by using traditional cluster methods available.

As suggestions for future research, a natural subsequent stage refers to the use of real-world variables as input data. In fact, the proposed method can potentially be applied to a wide range of fields of knowledge and research problems – as diverse as biology, medicine, geopolitics, engineering, international trade agreements, optimal M&A partnership, marketing, among many more, in order to compare similarity levels between sample subjects and, therefore, not being limited only to applications in finance and economics. In addition, the development of an algorithm (e.g. MatLab code/program) to perform all steps of the proposed method and the use of real-world economic variables as input data on this method are on-going projects. Needless to mention that insights based on outputs to be obtained by the application of the proposed method on distinct data sets remain an empirical question.

Acknowledgements

This paper benefited from stimulating conversations with Philip Arestis, Mardi Dungey, Andrew Harvey, Mohammad Hashem Pesaran, Jacob Rasmussen, John Glascock, John Howe, Neil Dodgson, and Alan Thompson, as well as two anonymous referees for their careful reading and insightful comments and suggestions. All errors remain exclusively the author’s responsibility. The present research was supported by Coordination for the Improvement of Higher Education Personnel of Brazil (CAPES) and The Cambridge Commonwealth, European & International Trust (under grant BEX 2220/15-6).

Footnotes

1

Both terms (i.e. asset and sphere) will be used interchangeably throughout this paper.

2

Regarding the index notation adopted in this paper, in an attempt to be as neat as possible, throughout the paper the indexes related to each of the sample subjects (e.g. i, j) as well as time (i.e. t) are separated by a vertical bar - i.e. |. Therefore, in the present paper, this symbol does not mean conditional probability, as commonly used in Bayesian statistics.

3

For further information on numerical analysis techniques to compute spheres intersection in three-dimensional spaces, see Strobl et al. [15] and George et al. [16].

References

  • 1.Andersen T.G., Bollerslev T., Diebold F.X., Ebens H. The distribution of realized stock return volatility. J. Financ. Econ. 2001;61(1):43–76. [Google Scholar]
  • 2.Schwert G.W. Business cycles, financial crises, and stock volatility. Carnegie-Rochester Conference Series on Public Policy. 1989;31:83–125. (September) [Google Scholar]
  • 3.Reinhart C.M., Rogoff K.S. Princeton University Press; 2009. This Time Is Different: Eight Centuries of Financial Folly. [Google Scholar]
  • 4.Arestis P., Karakitsos E. Financial Stability in the Aftermath of the ‘Great Recession’. Palgrave Macmillan; London: 2013. Lessons from the ‘Great Recession’ for both theory and economic policy; pp. 164–192. [Google Scholar]
  • 5.Jain A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010;31(8):651–666. [Google Scholar]
  • 6.Hennig C., Meila M. Handbook of Cluster Analysis. CRC Press, Taylorand Francis Group; 2015. Cluster analysis: an overview; pp. 1–20. [Google Scholar]
  • 7.Tsay R.S. vol. 543. John Wiley & Sons; 2005. (Analysis of Financial Time Series). [Google Scholar]
  • 8.Kern W.F., Bland J.R. J. Wiley & Sons, Incorporated; 1938. Solid Mensuration: With Proofs. [Google Scholar]
  • 9.Alvord B. The intersection of circles and the intersection of spheres. Am. J. Math. 1882;5(1):25–44. [Google Scholar]
  • 10.Court N.A. Four intersecting spheres. Am. Math. Mon. 1960;67(3):241–248. [Google Scholar]
  • 11.Serezhkin V.N., Mikhailov Y.N., Buslaev Y.A. The method of intersecting spheres for determination of coordination numbers of atoms in crystal structures. Russ. J. Inorg. Chem. 1997;42(12):1871–1910. [Google Scholar]
  • 12.Weisstein E.W. 2007. Sphere-Sphere Intersection. [accessed 2019]. Available at: http://mathworld.wolfram.com/Sphere-SphereIntersection.html. [Google Scholar]
  • 13.Borowski E.J., Borwein J.M. HarperCollins; 1991. The HarperCollins Dictionary of Mathematics. [Google Scholar]
  • 14.Alt H., Fuchs U., Rote G., Weber G. Matching convex shapes with respect to the symmetric difference. Algorithmica. 1998;21(1):89–103. [Google Scholar]
  • 15.Strobl S., Formella A., Pöschel T. Exact calculation of the overlap volume of spheres and mesh elements. J. Comput. Phys. 2016;311:158–172. [Google Scholar]
  • 16.George P.L., Hecht F., Saltel É. Automatic mesh generator with specified boundary. Comput. Methods Appl. Mech. Eng. 1991;92(3):269–288. [Google Scholar]

Articles from MethodsX are provided here courtesy of Elsevier

RESOURCES