Skip to main content
PLOS One logoLink to PLOS One
. 2021 Jan 14;16(1):e0245092. doi: 10.1371/journal.pone.0245092

Covariance matrix filtering with bootstrapped hierarchies

Christian Bongiorno 1,*,#, Damien Challet 1,#
Editor: Roberta Sinatra2
PMCID: PMC7808632  PMID: 33444350

Abstract

Cleaning covariance matrices is a highly non-trivial problem, yet of central importance in the statistical inference of dependence between objects. We propose here a probabilistic hierarchical clustering method, named Bootstrapped Average Hierarchical Clustering (BAHC), that is particularly effective in the high-dimensional case, i.e., when there are more objects than features. When applied to DNA microarray, our method yields distinct hierarchical structures that cannot be accounted for by usual hierarchical clustering. We then use global minimum-variance risk management to test our method and find that BAHC leads to significantly smaller realized risk compared to state-of-the-art linear and nonlinear filtering methods in the high-dimensional case. Spectral decomposition shows that BAHC better captures the persistence of the dependence structure between asset price returns in the calibration and the test periods.

Introduction

Covariance matrix estimation is a cornerstone of dependence inference between objects. Unfortunately, this kind of matrix becomes very noisy when the number of objects is similar to the number of features, a phenomenon known as the curse of dimensionality. Even worse, unfiltered covariance matrices are pathological when the number of features exceeds the number of objects, i.e., in the so-called high dimensional case. This case is frequent e.g. in biological data and in multivariate dynamical systems such as financial markets in which only the most recent history is likely to be relevant.

Given its importance, covariance matrix filtering has a long history. A popular approach is to obtain a filtered covariance matrix from the corresponding correlation matrix. Two types of approaches stand out: i) spectral methods, e.g. Random Matrix Theory, Rotationally Invariant Estimators [1], and Shrinkage [2, 3]; ii) ansatz for the correlation matrix, e.g. block-diagonal [4] or hierarchical [5].

The usual setting is to have n objects and t features and to compute the correlation matrix between these n objects. Recent results on Rotationally Invariant Estimators [6] propose algorithms able to correct the eigenvalue spectrum of covariance matrices optimally without filtering its eigenvectors: the inversion of the QuEST function [7], the Cross-Validated (CV) eigenvalue shrinkage [8] and the IW-regularization [1], the latter being valid only in the low dimensional regime q = n/t < 1, i.e., when there are more features than objects. Direct eigenvector filtering is more complex. An indirect way to filter both eigenvectors and eigenvalues is to use ansätze for the shape of the true correlation matrix, which also impose constraints on the structure of both the eigenvectors and the eigenvalues. A good ansatz should be simple enough to clean noise but flexible enough to account for fine relevant details. The popular hierarchical clustering ansatz (HC thereafter) is indeed simple: it assumes that correlations are nested [5, 9], which is equivalent to assume that dependencies are described by a dendrogram (a tree).

An obvious problem of HC occurs when the structure is more complex than a tree: for example, the non-diagonal blocks in Figs 1 and 2 are erased by a hierarchical ansatz. As a consequence, a non-negligible part of the dependence structure is left out. In these cases, the tree inferred by a hierarchical ansatz is fragile with respect to small data perturbations such as bootstraps. The fragility itself was noted for example in Ref. [10] which showed that only a subset of links of a minimum spanning tree associated to a HC are reliable when data are perturbed by bootstraps. In practice, it is hard to find statistically-validated hierarchical structures [11] when the fitted hierarchical structure is highly sensitive to small variations of data.

Fig 1. Correlation matrix from tissue-gene micro-array data of patients affected by lung cancer.

Fig 1

The upper left plot is the sample correlation matrix, the upper right plot is the result of hierarchical and average-linkage averaging (HCAL). The bottom left plot is the difference between the two: it still has evident structure unaccounted for by HCAL.

Fig 2. Correlation matrix of US equities price returns in the 2008-01-23 to 2008-11-04 (left plot) and in the 2008-11-05 to 2009-08-24 period (right plot).

Fig 2

The elements of both panels are ordered according to the in-sample HCAL dendrogram of the first period.

Here, we introduce a more flexible method able to capture more of the structure of the eigenvectors. The idea is to create many bootstrapped copies of the original data and to apply hierarchical clustering average linkage (HCAL) [5] filtering to each of them. We then average all these HCAL-filtered matrices. We call our method BAHC, which stands for Bootstrapped Average Hierarchical Clustering, and define it for covariance and correlation matrices. A BAHC-filtered matrix is a sum of multiple hierarchical structures weighted by their frequency. A single hierarchical structure will only emerge if all the bootstrap realizations lead to the same dendrogram. Thus, this method is particularly adapted to data that is well-described by a hierarchical structure in a first approximation [12] but avoids selecting a single fragile structure.

We illustrate the power of our method with data from two relevant fields. First, in bioinformatics, DNA micro-array gene expression dependence in tissues is frequently characterized by correlation matrices. Hierarchical clustering and its variants are commonly used [13, 14], which helps simplify the covariance matrix by linkage averaging [15] (see Fig 1). When there are several different candidates of hierarchical structure, this approach only selects a single one, which neglects possibly crucial information held by alternative structures. Comparing unfiltered correlation matrices with the filtering yielded by hierarchical clustering and average linkage (HCAL) [5] (Fig 1) makes it clear first that (i) hierarchical clustering does capture some of the structure and (ii) a substantial part of the structure is lost (see the bottom plot). This is because hierarchical clustering imposes too strict a structure, which erases out an uncontrolled amount of information.

Another domain in which covariance matrix filtering plays a central role is risk management in many areas. Broadly speaking, the problem amounts to minimize future uncertainty by determining the fraction of resources to allocate to every possible choice. Risk in this particular context is due to fluctuations of the future value of the choices. The usual procedure consists in minimizing a suitable risk measure in the calibration window and hoping that the future, realized, risk will bear some relationship with the calibrated risk.

The simplest approach consists in defining risk as the variance of the weighted sum of choices’ values and to minimise it. This is known as global minimum-variance portfolios, a subfield of quadratic portfolio optimization which has a wide range of applications: investment into technologies [16], energy sources mix for countries [17, 18], wind farm locations [19], and capital allocation in finance [20]. We shall focus on financial risk because data are abundant, which makes it possible to compare the out-of-sample performance of filtering methods. In addition, the high-dimensional regime is particularly relevant in finance: there are many assets to choose from and the speed with which the dependence structure between asset price returns may change asks for an as short as possible calibration period [21].

In an inference or descriptive context such as DNA microarray data analysis, filtering correlation matrices is meant to bring estimated covariance matrices closer to the ground truth. In a dynamical context, especially for non-stationary systems such as financial markets, what matters is the part of the ground truth that most likely persists after the calibration period, i.e., when one uses the allocation weights computed from the filtered covariance matrix. Thus, ideally, the filtered covariance matrix should contain as much of the persistent structure as possible. The nature of the most likely persistent structure is of course unknown from the calibration window only. Fig 2 shows that there are indeed strongly persistent dependence structures of asset price returns between two non-overlapping periods. Similarly to correlation matrices of DNA microarray data, while a pure HC does capture a sizeable part of the useful structure, the non-diagonal correlation patterns blocks e.g., around (x, y) = (140, 600) indicate that HC itself is not sufficient.

Methods

Datasets description

We consider the daily close-to-close returns from 1992-02-03 to 2020-03-31 of US equities, adjusted for dividends, splits, and other corporate events. More precisely, the dataset consists of 1295 assets taken from the union of all the components of the Russell 1000 from 2010-06 to 2020-03. The number of stocks with data varies over time: it ranges from 151 in 1992-06-22 to 1172 in 2018-01-17 (see S1 File for a code to download the data).

DNA microarray data [22] can be downloaded from [23]. It consists of gene expression intensity of 327 tissues of patients affected by pediatric acute lymphoblastic leukemia and a subset of 271 genes.

Numerical simulations with financial data

All the simulations are carried out in the same way: each point of each plot is an average over 10, 000 simulations, each of which includes an in-sample window of length tin and an out-of-sample window of length tout = 42 days (about two trading months) unless otherwise specified; it starts from a random day uniformly chosen in the available dataset. To have meaningful in- and out-of-sample windows given the maximum tin considered, the first day of the out-of-sample must be after 01-01-2000; each simulation selects n = 100 assets at random among the assets with no missing value in both in- and out-of-sample windows.

BAHC algorithm

Given matrix RRn×t, our method prescribes to create a set of m (feature-wise) bootstrap copies of R, denoted by {R(1), R(2), ⋯, R(m)}. A single bootstrap copy of the data matrix R(b)Rn×t has elements rij(b)=risj(b), where s(b) is a vector of dimension t obtained by random sampling with replacement of the elements of vector {1, 2, ⋯, t}. The vectors s(b), b = 1, ⋯, m are independently sampled.

The Pearson correlation matrix of each bootstrapped data matrix R(b) is then computed and denoted by C(b); in turn the latter is filtered with the hierarchical clustering average linkage (HCAL) proposed in [5], which yields C(b)<. In short, HCAL uses two ingredients: the distance D = 1 − C to agglomerate cluster in a hierarchical way, and the averaging of the correlation between clusters (see S1 Appendix for more details).

Finally, the filtered correlation matrix CBAHC is the average of the HCAL-filtered matrices C(b)<

CBAHC=1mb=1mC(b)<.

To build a BAHC-filtered covariance matrix, we estimate the standard deviation of ri, denoted by σii, and obtain the element of the BAHC-filtered covariance matrix as

σijBAHC=cijBAHCσiiσjj.

Source code

We have written a BAHC package for both R and Python, available from CRAN and PyPI, respectively.

Frobenius norms

We use rescaled Frobenius norms to account for the fact that the number of assets in our dataset depends on time, defined as

XFΣ=i,jn×nxij2n2. (1)

In addition, because CV, LW and QuEST methods do not guarantee the identity on the diagonal of filtered correlation matrices (contrarily to BAHC), we do not include the diagonal elements in the metric and thus define

XFC=i>jn×n2xij2n(n-1). (2)

We found that the performance of CV, LW, QuEST-based correlation estimators is slightly improved by replacing cij with cijciicjj, which also ensures that the diagonal elements equal one, and thus have used this modification in our analysis.

Results

Microarray DNA

We first apply the BAHC method to DNA microarray data [22] where the objects are n = 327 tissues of patients affected by pediatric acute lymphoblastic leukemia and features are the expression intensities of t = 271 genes (q ≃ 1.21). Classifying leukemia subtypes based on their gene expression profile is crucial to correct prognosis and risk assessment. However, the simplistic classification obtained from a single tree could lose relevant information coming from more complex dependence structures.

To show the new insights brought by BAHC compared to a simple hierarchical clustering, we kept the dendrograms of all the bootstraps used to compute the BAHC-filtered correlation matrix and produced a bidimensional t-SNE projection [24] using the pairwise cophenetic correlation coefficient as a distance. In this map, each point corresponds to a bootstrapped copy of the original data. Two such copies are represented nearby if the cophenetic correlation between their HC-filtered dendrogram is high—in simple words, if they are similar. If two randomly chosen bootstrap dendrograms differ only due to sample size error, we should expect such bi-dimensional mapping scattered around an average dendrogram. However, two main clusters of dendrograms appear. They essentially differ by the topmost branches, as shown by the tanglegram of the centroids of these two clusters (right plot of Fig 3). This means that in this dataset, a small perturbation not only affects the lower levels of the dendrograms, whose composition is based on the stability single or pairs of correlation coefficients that are necessarily highly affected by sample size error, but also the highest aggregate levels, which should be more robust to sample size noise. In other words, the appearance of two clear clusters of dendrograms shows that a single dendrogram fails to account for the real dependence between gene expression intensity. In addition, clades that are distant on the sample dendrogram may be much closer in both of these clusters.

Fig 3. Bidimensional t-SNE projection of the cophenetic distance between the dendrograms of 1000 bootstraps of DNA microarray data [22].

Fig 3

Two main clusters emerge, with further subclusters, corresponding to distinct potential hierarchies of dependence that are compatible with data. The red crosses indicate the centroids of the two largest clusters whose structure differences appear in the tanglegram of right plot.

This shows that even a large distance between two sub-groups of elements (cancers, in this case) may not be stable with respect to small perturbation of the data. Thus, if one wishes to cluster genes, one should generate bootstrapped dendrograms and then apply a clustering method adapted to trees, as we did above. If one needs a filtered covariance matrix, one should use BAHC instead of a HC.

Risk minimization

Given the n × (t + 1) matrix of values of choice i at time k, pi,k, and the value returns ri,k = pi,k/pi,k−1 − 1, one must determine the fraction of investment given to each choice i, the i-th component of vector w. The risk is measured by the standard deviation of the portfolio return, denoted by vP, with vP2=wTΣw, where Σ is the n × n covariance matrix of the matrix of returns R. If the weights can be negative, the optimal weights w˜=Σ-1·11T·Σ-1·1, with the condition ∑i wi = 1 in order to avoid the trivial solution w = 0. This situation is called long-short portfolio in the following. In some situations, e.g., when choosing one’s portfolio of energies or products, only positive weights are allowed, in which case one has to solve a quadratic programming problem; we refer to this situation as long-only portfolio.

The realized (out-of-sample) risk is the relevant performance measure. Using the out exponent, the realized risk is

vPout=(w˜)Σoutw˜,

where w˜ are computed from the in-sample covariance matrix, filtered or not, and X is the transpose of matrix X.

All the results reported below use the simulation setup described in the Methods section: in short, we perform 10,000 simulations of n = 100 random assets in random periods. We compare the out-of-sample risk computed from BAHC and several other well-known methods: the classic Ledoit and Wolf linear shrinkage method (LW henceforth) [2] and the more recent nonlinear shrinkage approach based on the inversion of the QuEST function (QuEST) [7]. We also include the Cross-Validated eigenvalue shrinkage (CV) [8] and HCAL [5], denoted by <.

Fig 4 shows that BAHC outperforms all the alternative methods for tin ≲ 200, i.e., for q=n/t12, which includes all of the high-dimensional regime q > 1. In particular, for the long-only portfolios, the BAHC method reaches the absolute minimum out-of-sample risk over all tin and all methods for tin ≃ 200, i.e., q ≃ 1/2. The right-hand-side plots of Fig 4 report the probability that BAHC outperforms each alternative method when q > 1/2, which confirms that BAHC is better than all the other methods not only with respect to the average realized risk, but also in probability in this region.

Fig 4. Left plots: Realized risk for different estimators; right plots: Fraction of time the realized risk of BAHC is smaller than the one obtained with alternative estimators.

Fig 4

10, 000 independent simulations per point; tout = 42 days, n = 100 assets, US equities.

Finally, we vary the length of the test window, tout. We report the probability that the BAHC method outperforms all its competitors as a function of both tin and tout in Fig 5. Our approach achieves lower realized risk with in more than half the simulations than any other method tested here as soon as tin < 177 (q > 1/1.17) for every tout in the considered range. Remarkably, as tout increases, the calibration length below which BAHC has better than 50% chances to outperform all its competitors only weakly increases. We interpret this result by the fact that our method is able to extract the right kind of persistent structure in that particular data, which is confirmed below by spectral analysis. We found similar results for the Hong Kong equity market (see S1 Appendix). We also report in the S1 Appendix an alternative analysis where the out-of-sample standard deviations are used to compute the portfolio compositions. This analysis aims to isolate the effect of correlation filtering approaches providing a lower bound for risk minimization. However, we did not observe any qualitative differences.

Fig 5. Fraction of time BAHC yields a smaller realized risk than all the alternative methods.

Fig 5

Left plot: portfolios with positive and negative weights; right plot: portfolios with only positive weights. The dotted line corresponds to q = t/n = 1, and the level curve to a 50% probability. 10, 000 independent simulations per point; tout = 42 days, n = 100 assets, US equities.

Spectral properties

In order to understand why and when our method has a better performance than the other methods based on spectral clustering, it is instructive to compare the in- and out-of-sample persistence of the eigenvalues and eigenvectors produced by all the filtering methods considered here. The spectral decomposition of correlation matrix C is denoted by C = U ΛU, where U is a n × n matrix formed by the eigenvectors of C and Λ is the diagonal matrix obtained from the corresponding eigenvalues.

Eigenvectors stability

A simple way to characterise the overall eigenvectors stability is to compare the empirical out-of-sample correlation matrix Cout with the Oracle correlation estimator defined as ΞCin=UinZinUin where Zin=diag(UinCoutUin) is the Oracle eigenvector estimator, the idea being that ΞCin=Cout if the in- and out-of-sample eigenvectors coincide (see S1 Appendix). The Oracle estimator for the covariance matrix, denoted by ΞΣin, is defined in a similar way.

Fig 6 reports the Frobenius distances (see the Methods section) Cout-ΞCinFC and Σout-ΞΣinFΣ as a function of tin for n = 100 assets. Note that CV, LW and QuEST methods all use the in-sample eigenvectors and thus we do not need to report separate results. Generally, our method yields more stable correlation and covariance matrices not only in the high-dimensional case, but also up to (q ≃ 3), i.e. tin < 300. The difference is due to the fact that the eigenvectors obtained by our method are more stable than the vanilla in-sample eigenvectors, which mechanically improves the Oracle estimator.

Fig 6. Frobenius distance between the out-of-sample matrices and the Oracle estimators obtained with the in-sample eigenvectors (in), the in-sample BAHC-filtered eigenvectors (BAHC) and the in-sample HCAL-filtered eigenvectors (<).

Fig 6

Upper panels refer to correlation matrices C, lower panels to covariance matrices Σ. The left panels are the Frobenius norm of the difference between the estimator and the out-of-sample realization; the right panels are the fraction of time BAHC outperforms the alternative estimators. 10, 000 independent simulations per point; tout = 42 days, n = 100 assets, US equities.

Fig 6 also shows that the probability that the eigenvectors of BAHC-filtered correlation matrices are more stable than those provided by the alternative filtering methods grows as tin becomes smaller. The same applies to the comparison between BAHC -filtered and empirical covariance matrices, while HCAL, denoted by <, has better performance in about a 20% of samples almost independently of tin. In short, as soon as q > 1/3 in this dataset, the BAHC method likely yields more persistent eigenvectors than all the other filtering methods considered here.

Eigenvalues stability

Since both the covariance Σ and precision Σ−1 matrices are relevant to minimum-variance optimization, we measure two types of residues that focus on large and small eigenvalues, defined as

ϵhi=1ni=1n(λi-zi)2 (3)
ϵlow=1ni=1n(1λi-1zi)2, (4)

where λi = (Λ)ii is the i-th (ranked) eigenvalue of the in-sample estimator and zi = (Zin)ii comes from the Oracle estimator computed with the respective filtered eigenvector matrix and i is the respective rank of these eigenvalues. The residue measure ϵhi mainly accounts for the discrepancy between the largest eigenvalues and the residue measure ϵlow attributes more weight to the discrepancy between the smallest eigenvalues.

Fig 7 plots the residues of the correlation and covariance matrices respectively as a function of tin. We compare our approach with the sample estimator, HCAL-filtered matrix, and the Cross-Validated (CV) eigenvalue distribution. While CV method outperforms all the other methods when tin ≲ 1000 (q > 0.01), the eigenvalues produced by our method are still much closer to the Oracle than those of the raw sample estimator when tin ≲ 500.

Fig 7. Average residue ϵhi and ϵlow over 10, 000 simulations with random calibration windows and a random selection of n = 100 assets.

Fig 7

The upper panel refers to the correlation matrix, the lower panel refers to the covariance matrix. 10, 000 independent simulations per point; tout = 42 days, n = 100 assets, US equities.

Filtered correlation and covariance matrices

The ultimate test is of course to compare filtered in-sample matrices with out-of-sample matrices. Fig 8 reports the Frobenius distance between the filtered in-sample and out-of-sample correlation and covariance matrices for all the tested methods. Expectedly, BAHC outperforms all the other ones for tin ≲ 300. Fig 8 plots the fraction of times the Frobenius norm of our method is lower than the other methods, which confirms the superiority of BAHC for q ≤ 2 and also shows that BAHC method HCAL filtering for every tin. Once again, this emphasizes that a strict hierarchical structure is not sufficient to capture the stable structure of eigenvectors fully.

Fig 8.

Fig 8

Left plots: Frobenius distance between out-of-sample matrices and filtered in-sample matrices; upper panels refer to correlation matrices C, lower panels to covariance matrices Σ. Right plots: Fraction of time the Frobenius distance of BAHC-filtered matrices is smaller than the alternative estimators. 10, 000 independent simulations per point; tout = 42 days, n = 100 assets, US equities.

Conclusions

Filtering covariance and correlation matrices requires to take care of O(n2) coefficients. Focusing on O(n) variables, for example by tweaking the eigenvalues or using a single hierarchical ansatz, works to some extend. Making further progresses requires to filter more variables, if possible while keeping an O(n) ansatz. This is what the BAHC method achieves: by using m bootstraps and applying an O(n) structure, BAHC allows some additional flexibility, while keeping the overall structure simple.

Our method both filters out estimation noise and improves the stability of the eigenvectors in a dynamical context. Indeed, the spectral decomposition of BAHC-filtered correlation matrices is close to the optimal CV method with respect to the eigenvalue distribution. Furthermore, in the dynamical context investigated here, the eigenvectors produced by our method have a higher overlap with the out-of-sample ones than the unfiltered in-sample eigenvectors for reasonably small q = n/t. This is why our method leads to better minimum-variance portfolios than all the competing filtering methods when the calibration window is small. In particular, if no short selling is allowed, our approach produces, on average, the lowest-risk portfolio.

Future work is needed to characterize the average dependence structure produced by BAHC better, from both theoretical and empirical points of view. In addition, BAHC may still be too strict in some cases and thus leave out valuable information, hence, further refinements of the ansatz will need to be investigated.

Supporting information

S1 Appendix

(PDF)

S1 File. Financial dataset code.

(ZIP)

Acknowledgments

This publication stems from a partnership between CentraleSupélec and BNP Paribas. This work was performed using HPC resources from the “Mésocentre” computing center of CentraleSupélec and École Normale Supérieure Paris-Saclay supported by CNRS and Région Île-de-France.

Data Availability

Financial Data cannot be shared publicly although they are publicly available online. We included in the electronic supplementary material the code that we used to download the data from Yahoo Finance.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1. Bun J, Bouchaud JP, Potters M. Cleaning large correlation matrices: tools from random matrix theory. Physics Reports. 2017;666:1–109. 10.1016/j.physrep.2016.10.005 [DOI] [Google Scholar]
  • 2. Ledoit O, Wolf M. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis. 2004;88(2):365–411. 10.1016/S0047-259X(03)00096-4 [DOI] [Google Scholar]
  • 3. Ledoit O, Wolf M. Nonlinear shrinkage of the covariance matrix for portfolio selection: Markowitz meets Goldilocks. The Review of Financial Studies. 2017;30(12):4349–4388. 10.1093/rfs/hhx052 [DOI] [Google Scholar]
  • 4.Begušić S, Kostanjčar Z. Cluster-Based Shrinkage of Correlation Matrices for Portfolio Optimization. In: 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA). IEEE; 2019. p. 301–305.
  • 5. Tumminello M, Lillo F, Mantegna RN. Hierarchically nested factor model from multivariate data. EPL (Europhysics Letters). 2007;78(3):30006 10.1209/0295-5075/78/30006 [DOI] [Google Scholar]
  • 6. Bun J, Allez R, Bouchaud JP, Potters M. Rotational invariant estimator for general noisy matrices. IEEE Transactions on Information Theory. 2016;62(12):7475–7490. 10.1109/TIT.2016.2616132 [DOI] [Google Scholar]
  • 7. Ledoit O, Wolf M, et al. Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics. 2012;40(2):1024–1060. 10.1214/12-AOS989 [DOI] [Google Scholar]
  • 8.Bartz D. Cross-validation based Nonlinear Shrinkage; 2016.
  • 9. Pantaleo E, Tumminello M, Lillo F, Mantegna RN. When do improved covariance matrix estimators enhance portfolio optimization? An empirical comparative study of nine estimators. Quantitative Finance. 2011;11(7):1067–1080. 10.1080/14697688.2010.534813 [DOI] [Google Scholar]
  • 10. Tumminello M, Coronnello C, Lillo F, Micciche S, Mantegna RN. Spanning trees and bootstrap reliability estimation in correlation-based networks. International Journal of Bifurcation and Chaos. 2007;17(07):2319–2329. 10.1142/S0218127407018415 [DOI] [Google Scholar]
  • 11. Bongiorno C, Miccichè S, Mantegna RN. Nested partitions from hierarchical clustering statistical validation; 2019. [Google Scholar]
  • 12. Mantegna RN. Hierarchical structure in financial markets. The European Physical Journal B-Condensed Matter and Complex Systems. 1999;11(1):193–197. 10.1007/s100510050929 [DOI] [Google Scholar]
  • 13. Quackenbush J. Computational analysis of microarray data. Nature Reviews Genetics. 2001;2(6):418–427. 10.1038/35076576 [DOI] [PubMed] [Google Scholar]
  • 14. Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics. 2015;2015 10.1155/2015/198363 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. 10. Springer Series in Statistics New York; 2001.
  • 16. Hubbard DW. How to measure anything: Finding the value of intangibles in business. John Wiley & Sons; 2014. [Google Scholar]
  • 17. Roques FA, Newbery DM, Nuttall WJ. Fuel mix diversification incentives in liberalized electricity markets: A Mean–Variance Portfolio theory approach. Energy Economics. 2008;30(4):1831–1849. 10.1016/j.eneco.2007.11.008 [DOI] [Google Scholar]
  • 18. Arnesano M, Carlucci A, Laforgia D. Extension of portfolio theory application to energy planning problem–The Italian case. Energy. 2012;39(1):112–124. 10.1016/j.energy.2011.06.053 [DOI] [Google Scholar]
  • 19. Dunlop J. Modern Portfolio Theory Meets Wind Farms. The Journal of Private Equity. 2004;7(2):83–95. 10.3905/jpe.2004.391052 [DOI] [Google Scholar]
  • 20. Markowitz HM, Todd GP. Mean-variance analysis in portfolio choice and capital markets. vol. 66 John Wiley & Sons; 2000. [Google Scholar]
  • 21.Bongiorno C, Challet D. Nonparametric sign prediction of high-dimensional correlation matrix coefficients; 2019.
  • 22. Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002;1(2):133–143. 10.1016/S1535-6108(02)00032-6 [DOI] [PubMed] [Google Scholar]
  • 23.St. Jude Children’s Research Hospital; https://www.stjuderesearch.org/site/data/ALL1/all_rawdata. Accessed on 2020.03.05.
  • 24. Maaten Lvd, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008;9(Nov):2579–2605. [Google Scholar]

Decision Letter 0

Roberta Sinatra

2 Nov 2020

PONE-D-20-25768

Covariance matrix filtering with bootstrapped hierarchies

PLOS ONE

Dear Dr. Bongiorno,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The Reviewers find the research sound, but suggest to clarify some technical aspects as well as the rationale behind the applications. I agree with them that a revision along these lines would improve the paper, making it suitable for publication. 

Please submit your revised manuscript by Dec 17 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Prof. Roberta Sinatra

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper is very interesting and possibly suggest a real progress in the methods to clean noisy covariance matrices. The paper is rather well written, but I have 3 minor suggestions

1) The introduction could be made more crisp. In particular, two paragraphs seem to repeat themselves somewhat: one starting with "Here, we introduce a more flexible hierarchical ansatz able to capture more of the structure of the eigenvectors." and then later "Here, we propose a method that improves on hierarchical clustering. We exploit the fact that the less adequate a hierarchical ansatz, the more fragile it is with respect to small data perturbations. "

2) The Hierarchical Clustering Method cannot be understood on the basis of the information given in the SI, so maybe the authors could improve their explanation in order to make the paper more self contained.

3) The authors may be interested to have a look at the following paper, where the problem of comparing in-sample and out-of-sample eigenvectors is discussed:

Bun, J., Bouchaud, J. P., & Potters, M. (2018). Overlaps between eigenvectors of correlated random matrices. Physical Review E, 98(5), 052145.

Reviewer #2: The paper proposes a bootstrap based method to hierarchically cluster data. The proposed method is termed Bootstrapped Average Hierarchical Clustering (BAHC) and is applied to biological and financial data. A through comparison with competing methods is performed (especially for the financial case, where a real life application is proposed). Clearly the literature on hierarchical clustering is huge and it is very difficult to say whether there are other methods to be used as benchmark.

Overall the paper is interesting and the results are sound. I recommend revision and resubmission, asking authors to respond my comments/criticisms below.

Major remarks:

1) The use of bootstrap in hierarchical clustering has been pioneered in Tumminello et al., Spanning Trees and bootstrap reliability estimation in correlation based networks, International Journal of Bifurcation and Chaos 17, 2319-2329 (2007). Actually the type of bootstrap looks the same, the main difference being that in Tumminello et al authors considers the Minimum Spanning Tree associated with a hierarchical clustering, while the current paper considers the hierarchical tree. A discussion about the similarity between the two approaches should be added and of course the above paper quoted in the bibliography.

2) The method uses a novel clustering algorithm (HCAL of Ref. [9]) to build the filtered matrices C^{(b)<} whose average over bootstrap replicas gives the filtered correlation matrix C^{BACH}. The first obvious question is how critical is the use of HCAL with respect to the many other existing clustering algorithms. Why do authors choose this method? What happens if other methods (such as average linkage) is used. The second question is whether C^{(b)<} is a correlation matrix and if this is important. For example, other clustering methods might provide filtered matrices which are not correlation matrices (for example they are not definite positive), but maybe at the end this is not so important.

3) I am very confused from the biological example. I do not understand what is plotted in Fig. 3 and I am not sure if the displayed result is a good or a bad news for the method. Since Plos ONE is a multidisciplinary journal, it would be nice to have a clearer explanation of this important example.

4) In the portfolio exercise, how do authors estimate past volatilities? Are these simply the standard deviation of returns? The question is relevant since the optimal portfolio strongly depends on the volatilities. Some authors, in order to focus on the estimator of the correlation matrix, use future volatilities. Is this done by the authors?

Minor remarks:

1) Line 176: Authors define q=n/t in the portfolio analysis. However, since the seminal paper by Laloux et al (Phys. Rev. Lett. 1999) the parameter has been defined as Q=T/N (with Laloux's notation). I think it would be less confusing for readers experienced with this type of literature ti use Laloux's Q.

2) Line 185 "linewith" -> "line with"

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jan 14;16(1):e0245092. doi: 10.1371/journal.pone.0245092.r002

Author response to Decision Letter 0


25 Nov 2020

Both reviews helped us to clarify and expand our submission where needed. We are grateful to them for this.

About the Data Availability concern, unfortunately, we are not allowed to share the data. However, since they are free downloadable from Yahoo Finance, we included the list of equities and the code to download them in the S.I.

Reviewer #1: This paper is very interesting and possibly suggest a real progress in the methods to clean noisy covariance matrices. The paper is rather well written, but I have 3 minor suggestions

1) The introduction could be made more crisp. In particular, two paragraphs seem to repeat themselves somewhat: one starting with "Here, we introduce a more flexible hierarchical ansatz able to capture more of the structure of the eigenvectors." and then later "Here, we propose a method that improves on hierarchical clustering. We exploit the fact that the less adequate a hierarchical ansatz, the more fragile it is with respect to small data perturbations. "

Thank you for pointing this out. We have modified the introduction accordingly.

2) The Hierarchical Clustering Method cannot be understood on the basis of the information given in the SI, so maybe the authors could improve their explanation in order to make the paper more self contained.

We have modified this part by improving the text and adding an algorithmic description of the algorithm, including that of the average linkage filtering.

3) The authors may be interested to have a look at the following paper, where the problem of comparing in-sample and out-of-sample eigenvectors is discussed:

Bun, J., Bouchaud, J. P., & Potters, M. (2018). Overlaps between eigenvectors of correlated random matrices. Physical Review E, 98(5), 052145.

We agree with the referee that this reference is relevant to our work, and we included a citation in the revised version. Indeed, prior to the first submission, we analyzed the overlap metric defined in that work.

We include here an example for N=100,T=150 of the overlap matrix computed over 1000 randomly sampled consecutive windows. Although the results seem promising we are not confident in the interpretation since the marginal distribution of the eigenvalues is extremely different for our method. Furthermore, we are more interested in the N>T regime, whereas that paper investigates the opposite regime. For this reason, we prefer to not include this result in our paper.

Reviewer #2: The paper proposes a bootstrap based method to hierarchically cluster data. The proposed method is termed Bootstrapped Average Hierarchical Clustering (BAHC) and is applied to biological and financial data. A through comparison with competing methods is performed (especially for the financial case, where a real life application is proposed). Clearly the literature on hierarchical clustering is huge and it is very difficult to say whether there are other methods to be used as benchmark.

Overall the paper is interesting and the results are sound. I recommend revision and resubmission, asking authors to respond my comments/criticisms below.

Major remarks:

1) The use of bootstrap in hierarchical clustering has been pioneered in Tumminello et al., Spanning Trees and bootstrap reliability estimation in correlation based networks, International Journal of Bifurcation and Chaos 17, 2319-2329 (2007). Actually the type of bootstrap looks the same, the main difference being that in Tumminello et al authors considers the Minimum Spanning Tree associated with a hierarchical clustering, while the current paper considers the hierarchical tree. A discussion about the similarity between the two approaches should be added and of course the above paper quoted in the bibliography.

We thank the referee to raise this point. The type of bootstrap is definitely the same. The main difference is that instead of using the bootstraps to associate a reliability value to the clades of the sample dendrogram (or link of the MST as in the cited paper), we totally discard the sample dendrogram, and we consider multiple dendrogram realizations to build or correlation estimator.

Nevertheless, we agree that this paper was pioneering about this concept of bootstrapping dendrograms, and we included a citation in the text.

2) The method uses a novel clustering algorithm (HCAL of Ref. [9]) to build the filtered matrices C^{(b)<} whose average over bootstrap replicas gives the filtered correlation matrix C^{BACH}. The first obvious question is how critical is the use of HCAL with respect to the many other existing clustering algorithms. Why do authors choose this method? What happens if other methods (such as average linkage) is used. The second question is whether C^{(b)<} is a correlation matrix and if this is important. For example, other clustering methods might provide filtered matrices which are not correlation matrices (for example they are not definite positive), but maybe at the end this is not so important.

We thank the referee for having noticed such an embarrassing error. C^{(b)<} and C^{<} are not based on Bongiorno et al 2019 (now [11]), rather than on Tumminello et al 2007. Although the citations of the method section are correct, the introduction swapped them.

To the point, the only difference with respect to the method defined in Tumminello et al 2007 was to consider all the clades of the dendrogram. However, this approach is not a novelty of our paper since was already applied by the same authors on Pantaleo et al 2011 (now Ref [9]).

We agree with the referee that this method can be applied also to non-positive matrices, we believe that this is interesting and it will be investigated in future works.

3) I am very confused from the biological example. I do not understand what is plotted in Fig. 3 and I am not sure if the displayed result is a good or a bad news for the method. Since Plos ONE is a multidisciplinary journal, it would be nice to have a clearer explanation of this important example.

We thank the referee for this feedback. The point of Fig.3 is to show that a single dendrogram can fail to capture relevant information; therefore, a multi-dendrogram description, proposed in this work, should be preferred. In particular, we have shown that different bootstrap realizations of an HC dendrogram, instead of being scattered around a central unique dendrogram, they cluster around two or more centroid dendrograms. So this is bad news for the strict HC.

We extended the microarray section to clarify this concept.

4) In the portfolio exercise, how do authors estimate past volatilities? Are these simply the standard deviation of returns? The question is relevant since the optimal portfolio strongly depends on the volatilities. Some authors, in order to focus on the estimator of the correlation matrix, use future volatilities. Is this done by the authors?

We thank the referee for this excellent idea. The volatility estimator used in this paper is the historical standard deviations, we clarified that in the main text. However, we agree with the referee that considering future realized volatility can provide an upper bound of the performances of the method. We’ve included in SI such analysis, but we did not observe substantial qualitative differences.

Minor remarks:

1) Line 176: Authors define q=n/t in the portfolio analysis. However, since the seminal paper by Laloux et al (Phys. Rev. Lett. 1999) the parameter has been defined as Q=T/N (with Laloux's notation). I think it would be less confusing for readers experienced with this type of literature to use Laloux's Q.

We thank the referee for this observation. However, we think that is not possible to avoid confusion in the reader about that since other authors define q=n/t. For example in the recent review

Bun, J., Bouchaud, J. P., & Potters, M. (2017). Cleaning large correlation matrices: tools from random matrix theory. Physics Reports, 666, 1-109.

On page 4, they use q=n/t. Since we believe that actually, recent works by Bouchaud and coworkers are more relevant to this topic, we prefer to leave this definition of q.

2) Line 185 "linewith" -> "line with"

Corrected, thanks.

Attachment

Submitted filename: Rebuttal_PlosBAHC.docx

Decision Letter 1

Roberta Sinatra

22 Dec 2020

Covariance matrix filtering with bootstrapped hierarchies

PONE-D-20-25768R1

Dear Dr. Bongiorno,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Roberta Sinatra

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

All Reviewers agree that the revised manuscript addresses all comments from the first round of review and and they recommend it for publication. I agree with them.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: no additional comments, all remarks have been addressed. Authors have explained why the data cannot be made fully available

Reviewer #2: The revised version fully answers my previous comments. I recommend it for publication on PlosOne.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Acceptance letter

Roberta Sinatra

5 Jan 2021

PONE-D-20-25768R1

Covariance matrix filtering with bootstrapped hierarchies

Dear Dr. Bongiorno:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Roberta Sinatra

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix

    (PDF)

    S1 File. Financial dataset code.

    (ZIP)

    Attachment

    Submitted filename: Rebuttal_PlosBAHC.docx

    Data Availability Statement

    Financial Data cannot be shared publicly although they are publicly available online. We included in the electronic supplementary material the code that we used to download the data from Yahoo Finance.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES