Skip to main content
Open Research Europe logoLink to Open Research Europe
. 2023 May 16;3:81. [Version 1] doi: 10.12688/openreseurope.15789.1

scikit-matter : A Suite of Generalisable Machine Learning Methods Born out of Chemistry and Materials Science

Alexander Goscinski 1, Victor Paul Principe 1, Guillaume Fraux 1, Sergei Kliavinek 1, Benjamin Aaron Helfrecht 1,2, Philip Loche 1, Michele Ceriotti 1, Rose Kathleen Cersonsky 1,3,a
PMCID: PMC10792272  PMID: 38234865

Abstract

Easy-to-use libraries such as scikit-learn have accelerated the adoption and application of machine learning (ML) workflows and data-driven methods. While many of the algorithms implemented in these libraries originated in specific scientific fields, they have gained in popularity in part because of their generalisability across multiple domains. Over the past two decades, researchers in the chemical and materials science community have put forward general-purpose machine learning methods. The deployment of these methods into workflows of other domains, however, is often burdensome due to the entanglement with domain-specific functionalities. We present the python library scikit-matter that targets domain-agnostic implementations of methods developed in the computational chemical and materials science community, following the scikit-learn API and coding guidelines to promote usability and interoperability with existing workflows.

Keywords: Python, feature selection, sample selection, PCovR, KPCovR, feature reconstruction, directional convex hull

Plain-language summary

Numerous data-driven and machine-learning studies rely on the library scikit-learn, a package providing implementations and application interfaces to a collection of generally applicable machine learning (ML) methods. With scikit-matter (skmatter) we extend current methods, focusing on those that are actively used in the field of computational science and modeling of materials and chemical systems. We aim to provide users with the ability to seamlessly include these methods in their ML workflows by implementing them in compliance with the scikit-learn API and coding guidelines.

Introduction

While machine learning (ML) algorithms are applied in a wide variety of fields, the relative importance of different aspects of workflows can vary widely between disciplines. For those who apply ML to predict physical or chemical properties, there is an increased emphasis on engineering and understanding numerical features from the underlying physical entities—namely, atoms and their relative position in the structure or molecule—in a format compatible with ML pipelines 14 . Engineering these features is a rich sub-discipline of machine learning unto itself, and thus several best practices have been established for analysing the resulting representations, including a reduction of redundant information in features and samples 57 , comparing different featurisations on datasets 8, 9 , and analysing their reduced manifolds 10 . These methods are, however, often inextricably linked to the libraries computing these representations 1113 and are not available to a wider audience outside of these sub-communities. The objective of the open-source library scikit-matter is to make these ML methods accessible to a wider community by following the scikit-learn API and coding guidelines, and by treating the features as agnostic to their domain. It not only serves as a conduit between featurisation software for comparing alternative descriptors, but allows a ‘plug-and-play’ of these methods into any workflows, irrespective of the representation or field of use.

In this text, we will review several major modules in scikit-matter: 1) Feature Reconstruction Measures (Page 3), used to assess the mutual information contained in two separate representations of the same dataset, 2) Linear and Kernelized Principal Covariates Regression (Page 5), used to construct a new set of features designed to correlate with one or more target properties, 3) Farthest Point Sampling, CUR, and their PCovR-Adaptations (Page 11), methods used to sub-select features and/or samples for machine learning problems, and 4) the Directional Convex Hull Construction Page), used to identify data points that sit on a bounding manifold of a dataset in a particular direction.

We demonstrate the use of these methods in contexts both inside and outside the chemical domain. We first revisit a dataset originally published in Engel et al. 14 and available publicly via the Materials Cloud Archive 15, 16 . This dataset contains theoretical and known ice structures, i.e. 15’869 “reasonable” crystal structures with hydrogen and oxygen in a 2:1 ratio, their lattice energies (in eV/molecule) and densities (in g/cm 3). We have augmented this dataset to include the atomic Mulliken charges, as computed with DFTB+ 17 , in units of elementary charges [e].

We also show the utility of these methods beyond the chemical sciences by employing the World Health Organization statistics on life expectancy 18 , curated from the open repository from the World Bank. This dataset contains 2,020 data points, each representing a different country during a given year, reporting variables pertaining to the population size 19 , gross domestic product (GDP) 20 , health-based 21 and education-based 22 spending, prevalence of HIV / AIDS 23 and tuberculosis 24 , immunisations 25, 26 , and undernourishment of the country’s population 27 .

A Quick Guide to the Smooth Overlap of Atomic Positions (SOAP) Representation

For the ice dataset we use the Smooth Overlap of Atomic Positions (SOAP) representation to numerically encode our chemical data 28 . As the SOAP formalism is not well-known outside the atomistic machine learning field, here we give a quick and simplified overview of the concepts and terminology associated with this data representation.

Numerous electronic and chemical properties are determined by the spatial relationship between a “central” atom and its nearest neighbouring atoms 29 . For most properties, the importance of neighbouring atoms decays with their distance from our central atom. It is therefore worthwhile to only represent the neighbourhood around each atom up to a cutoff. In the SOAP representation, each atom in the neighbourhood is represented by a Gaussian density. A finite set of numerical descriptors of all densities in the neighbourhood is then used as features.

However, in order to make efficient use of our data, we need to incorporate the symmetries of our system, as many physical properties do not change under rotations and translations. We therefore use the neighbourhood centered on each atom i, known as atomic density ρ i , to remove the dependency on the origin of our system 6 . The neighbourhood is described by a combination of radial and spherical functions to form a basis for the expansion. One can choose one of many functional forms for radial bases 4, 28, 30 , and any number of basis functions to do this expansion. The number of basis functions determines the number of features. In general, a higher amount of features corresponds to a higher “resolution” of the neighbourhood.

The set of expansion coefficients of ρ i in this basis results in something known as density coefficients for each central atom i. We can take the correlation of these coefficients to build descriptors corresponding to two-body correlations ρi1 , three-body correlations ρi2 , or higher. Here, the superscript corresponds to the ν th-neighbour correlation order expansion, where a two-body correlation is the first-order expansion. We often write features symmetrised over rotations as ρiv¯ specifically calling ρi1¯ the “Radial Spectrum” and ρi2¯ the “Power Spectrum”.

A full explanation of the SOAP formalism is contained in Bartók et al. 28 and Willatt et al. 6 . SOAP is not the only popular machine-learning representation in chemical sciences, and a full review of many of the different physics-informed representations is available at Musil et al. 4 .

In line with its goal of being disentangled from domain-specific components, scikit-matter itself does not compute these atomic descriptors, and instead takes as input matrices corresponding to descriptors computed by prominent software such as librascal 4 , QUIP 3133 , and DScribe 34 . For the purposes of testing and examples, scikit-matter does contain minimal datasets, each including a small set of molecules or materials, a suitable featurisation, and a set of properties 35, 36 .

Methods contained in scikit-matter

Feature reconstruction measures

In order to compare two separate sets of features, one can employ regression errors to quantify the mutual relationships of different forms of featurisations, as demonstrated in Goscinski et al. 36 . We determine this error, or the feature reconstruction measure (FRM), by reconstructing one set of features from the other with a constrained transformation, where different constraints express different types of relationships between the two sets of features.

Say we have a dataset that is represented in full detail by a representation Θ, and we want to assess the amount of information lost by using an alternate representation . We can check the detail contained in by computing FRM( , Θ), where FRM = 0 corresponds to a perfect reconstruction with no loss, and FRM ≈ 1 denotes a complete information loss wrt. Θ. However, it is rare that there exists a ground-truth representation Θ and we are more commonly comparing two likely-imperfect representations and . In this case, we compute FRM( , ) and FRM( , ). The feature set that results in the higher reconstructive measure is considered higher in information capacity ( e.g. if FRM( , ) > FRM( , ), then is the more information-rich feature set).

In Figure 1 we show a schematic of the different FRMs contained in scikit-matter. The simplest FRM is the global feature reconstruction error (GFRE), expressed as the linearly decodable information, as given by performing ridge regression between the two sets of features. The global feature reconstruction distortion (GFRD) constrains the transformation to be orthogonal to demonstrate the deformation incurred by transforming between the two feature spaces.

Figure 1. The different forms of feature reconstructions to assess two feature spaces (blue and pink) describing the same dataset.

Figure 1.

Here, we are reconstructing the curved manifold (blue) using the planar manifold (pink), as it is often the case to approximate a complex manifold with a simpler alternative. The area framed by the dotted line is an example for a local neighbourhood of one sample (the pink dot) that enables the reconstruction of nonlinearities. (Top) The linear transformation used in the global feature reconstruction error (GFRE). (Middle) The orthogonal transformation used in the global feature reconstruction distortion (GFRD). (Bottom) A local linear transformation of a neighbourhood used in local feature reconstruction error (LFRE). On the right, the reconstructions of the manifold are drawn in pink together with the curved manifold in blue. The measures correspond to the root-mean-square difference between the reconstructed and curved manifold.

Extending the analysis to non-linear dependencies, the local feature reconstruction error (LFRE) applies the ridge regression for each point locally on the k-nearest neighbours. These unsupervised methods are of particular use in assessing the hyperparameters of ML descriptors 9 and have been employed to compare the efficiency of different basis sets in encoding geometrical information 4, 37 .

Implementation

The FRMs differ in two aspects: the locality of the reconstruction (global or local) and the constraints of the regression. In each FRM, the two feature sets are partitioned into training and testing sets. We standardise the features of and individually, then we regress the features of onto to compute the errors. In the global measures, we use the entire dataset for the reconstruction, whereas in the local measure we perform a regression for each sample on the set of the k-nearest points within . The number k is given by the user with the parameter n_local_points. The reconstruction error is by default computed using a 2-fold cross-validation and ridge regression as estimator. As most interesting applications use large feature vectors, we implemented a custom RidgeRegression2FoldCV in the skmatter.linear_model module to improve computational efficiency. For the reconstruction distortion we use orthogonal regression as implemented in OrthogonalRegression in the skmatter.linear_model module.

Workflow The FRMs can all be called similarly:

from skmatter.metrics import
    global_reconstruction_error as GFRE
gfre = GFRE(F, G)

from skmatter.metrics import
    local_reconstruction_error as LFRE
lfre = LFRE(F, G, n_local_points=10)

It is also possible to specify general feature scalers and estimators in any of the FRM classes, in which case the invocation looks like:

from skmatter.metrics import
    global_reconstruction_error as GFRE
    
gfre = GFRE(F, G, scaler=my_scaler,
             estimator=my_estimator)

Use Case: Determining the number of radial basis functions necessary for representing neighbourhood shells

In common frameworks for machine learning potentials, each atomic neighbourhood is represented by expanding the many-body correlation information over a set of basis functions 4 . The resolution of this expansion varies greatly based upon the type of dataset and choice of basis. In this use case, we analyse the number of radial basis functions required to resolve each neighbour shell from a central atom. Here, “shell” refers to a spherical shell that corresponds to a peak in the radial distribution function as sketched out in Figure 2a). For a given choice of basis functions, we say that the representation is converged when adding new bases in the series yields no new information. In other words, we can say that the number of basis functions n has converged when the GFRE nn + 1 saturates. Here we show the convergence behaviour of a PCA-optimised 1 basis set 37 expanded on the “Radial Spectrum”, showing the number of basis functions needed to resolve each neighbour shell up to numerical accuracy of the linear system solver within the ice dataset. The numerical accuracy of the solver is computed with GFRE for the identity relationship for 100 basis function.

Figure 2. The relationship between number of radial basis functions and the resolution of each neighbourhood shell in water.

Figure 2.

a) The different atomic shells present in the ice dataset. b) The convergence of the GFRE for an incremental change in the number of basis functions normalised by the numerical accuracy of the linear system solver. The numerical accuracy of the solver is determined by computing the GFRE for the identity relation for 100 basis functions. c) The number of basis functions required to resolve up to the n th neighbour shell.

For this dataset, we determine two neighbour shells for the species pairs H-H and O-H (demonstrated in Figure 2(a)). With each additional shell, we need a greater number of basis functions, as shown in Figure 2(c). We see a near-linear relationship between the number of shells considered and the number of basis functions needed to obtain convergence. In fact, using a toy dataset with equidistant shells and a uniform distribution of atoms in each shell, we recover a perfect linear relationship. Thus, the number of basis functions needed to obtain similar resolutions scales linearly with the number of shells. The difference between these results and those of the toy dataset can be explained by the irregularly-spaced shells and the non-uniform number of atoms across shells in the ice dataset.

Linear and non-linear principal covariates regression

Often, one wants to construct new ML features from their current representation in order to compress data or visualise trends in the dataset. In the archetypal method for this dimensionality reduction, principal components analysis (PCA), features are transformed into the latent space which best preserves the variance of the original data. Principal Covariates Regression (PCovR), as introduced by de Jong and Kiers 38 , is a modification to PCA that incorporates target information, such that the resulting embedding could be tuned using a mixing parameter α to improve performance in regression tasks ( α = 0 corresponding to linear regression and α = 1 corresponding to PCA). Helfrecht et al. 10 introduced the non-linear version, Kernel Principal Covariates Regression (KPCovR), where the mixing parameter α now interpolates between kernel ridge regression ( α = 0) and kernel principal components analysis (KPCA, α = 1) 39 .

The α parameter determines how much emphasis to give either the regression performance or feature reconstruction. As shown for a toy dataset in Figure 3, a KPCovR of α = 1.0 will replicate the results of a Kernel PCA (weighted entirely on the reconstruction task), whereas α = 0.0 contains the regression weights as features. Typically, a value of α ≈ 0.5 yields the most qualitatively insightful results, provided that the features and targets are properly normalised.

Figure 3. The evolution of latent-space projections and regressions as the mixing parameter α goes from 1 (Kernel PCA) to 0 (Kernel Ridge Regression) in Kernel PCovR.

Figure 3.

This procedure transforms the latent space projection to minimise combined KPCA and KRR loss. Typically, a value of α = 0.5 yields a balanced projection and can be used to construct insightful feature-property maps.

Implementation

In a PCA decomposition, we obtain a latent-space projection by taking the singular value decomposition of the feature matrix X,

X=UKUCT(1)

where U K and U C are the eigenvectors of the Gram matrix ( XX T ) and covariance matrix X T X and Σ are the singular values. In PCovR, we instead use the eigendecomposition of a modified Gram or covariance matrix, for example mixing XX T with the outer product of our predicted targets ŶŶ T . As covered by de Jong and Kiers 38 , it is useful to build this decomposition on the approximated target values. In scikit-matter, this is done by providing the predicted targets Ŷ or a scikit-learn-style estimator with which to regress X onto Y.

Similarly, KPCovR constructs a modified kernel matrix, replacing the Gram matrix XX T with a user-selected kernel (with kernel-building capabilities and an API consistent with scikit-learn) and requires a kernel regressor to obtain the approximated Ŷ.

Workflow For example, a typical invocation of KPCovR is:

import skmatter.kernel_ridge.KernelRidge
import skmatter.decomposition.KernelPCovR

kernel_params = dict(kernel="rbf",
                        gamma=1)

regressor = KernelRidge(**kernel_params)

kpcovr = KernelPCovR(
                        mixing=0.5,
                        n_components=2,
                        regressor=regressor,
                        **kernel_params
                       )
kpcovr .fit(X, Y)

T = kpcovr.transform(X)
Yp = kpcovr.predict(X)

where X and Y are standardised matrices containing our features and target properties, respectively. For linear PCovR, regressor can be set to be any linear regression object within scikit-learn.

Use Case: Mapping the charge of oxygen in ice

We build a linear PCovR map using as X the 3-body SOAP representations reported in Engel et al. 14 and as targets Y, the Mulliken charges 2 (in units of e) for each of the water atoms in a subset of the ice structures.

By changing the mixing parameter α, we directly specify the attention that PCovR (or KPCovR) gives to the learning tasks of the target properties. In Figure 4, we show the performance of a PCovR mapping at both reconstructing the original SOAP vectors ( YXX^2 ) and regressing the charges ( YYY^2 ). Here we use a fixed regularisation constant for comparability between different numbers of components and mixing parameters.

Figure 4. Reconstruction and Regression Errors for the PCovR Mappings.

Figure 4.

In each case, we report the unitless l 2 loss for multiple values of α and number of components for reconstruction XXX^2 and regression of the Mulliken charges YYY^2 , both the sum (top) and separately (bottom). (left) Plotted against the Mixing Parameter α. (right) Plotted against the Number of Components.

While all latent-space projections approach the same loss with increasing dimensionality, when using fewer components, the effect on the regression loss Y is much larger than on the reconstruction loss X , as shown by the near-flat curves in X across the mixing parameter. Conversely, the change in Y across α can be quite large (as also demonstrated by the poor regression saturation to α = 0.999 in the top-right panel). You will also notice, particularly in the top-left panel, that the errors increase asymptotically in the extreme cases ( α = 0, α = 1). Thus in the majority of cases, an intermediate value will not only provide a better combined error, but also better performance in regression tasks while imparting minimal impact on the reconstruction task.

We also often choose a mixing parameter of α = 0.5 for visualisation tasks to weight equally the reconstruction and regression tasks. The resulting map, shown in the right side of ( Figure 5), delineates our oxygen environments based upon chemical similarity and our target charges, and from here we can clearly see the separation of oxygens in hydroxide (⋆, □, ᐁ, ⋄), hydronium (⊲, ⊳), and different arrangements of water, as shown in the insets below the maps. This is not true in the corresponding PCA (left), which is unable to distinguish these very different environments.

Figure 5. A Comparison of PCovR and PCA maps of Oxygens in Ice Structures.

Figure 5.

Both the upper maps show the first two components (left) or covariates (right) of the oxygen environments, coloured by the Mulliken charge. We have highlighted several extreme examples with special markers, corresponding to the markers also shown in the upper left corner of each inset.

Use Case: Reducing the dimensions of the WHO dataset to predict life expectancy

The variables in the WHO dataset suffice to predict the life expectancy with an R 2 score of 0.87 on an off-set testing set with cross-validated ridge regression. When we use traditional PCA to reduce this set of variables to two components, our accuracy in prediction drops to R 2 = 0.81 ( Figure 6(a)). This is marginally raised by turning to PCovR, where regressing on the first two covariates gives R 2 = 0.83 ( Figure 6(b)).

Figure 6. Linear and Non-linear Principal Components Analysis and Principal Covariates Regression Applied to the WHO Dataset.

Figure 6.

Here we show the resultant map and parity plot for ( a) PCA, ( b) PCovR at α = 0.5, ( c) KPCA, and ( d) KPCovR at α = 0.5 based on 13 statistical variables in order to predict life expectancy for a given country. In each case, we used the same standardised variable matrix to determine a two-dimensional latent space mapping (shown in the map) that was then fed into an appropriate regularised regression model, yielding the predictions in the parity plot. Regressions were all computed using the same 90/10 train/test split, with regression results reported for only the testing split.

The most dramatic results come from using non-linear methods. Employing kernel ridge regression with an optimised RBF kernel on the original features raises our maximum accuracy to R 2 = 0.96. Using the same kernel parameters to choose two features via kernel PCA reduces our accuracy to R 2 = 0.57 ( Figure 6(c)); however, by doing the same with kernel PCovR we sacrifice little in accuracy (R 2 = 0.96, Figure 6(d)). In other words, we are able to obtain nearly the same regression performance using two covariates determined by KPCovR as we did with the full kernel. This example, from start-to-finish, is available on the scikit-matter documentation via https://scikit-matter. readthedocs.io/en/v0.1.4/read-only-examples/FeatureSelection-WHODataset.html.

These kernel principal covariates can then be used to perform correlative analysis with the original demographic or statistical features, as was demonstrated by Cersonsky et al. 40 for medical diagnostics in the context of stillbirth outcomes.

Feature and sample selection

In this section, we will detail two very different classes of selectors: 1) Farthest Point Sampling, CUR, and their PCovR adaptions, which are flexible score-based greedy selectors 3 that can be employed for either feature or sample selection; and 2) the directional convex hull, a non-greedy sample selection algorithm which is primarily used in chemical thermodynamic analyses.

The feature and sample selection methods contained in scikit-matter are both programmatically alike and contextually different. Therefore, we established a consistent API to allow users to apply these similar concepts in their different contexts, and we will start by reviewing this common implementation.

Implementation

The classes are exposed to the user via skmatter.feature_selection and skmatter.sample_selection. Each selector is passed an input of X (and Y, when appropriate) to a fit function. This fit function will perform the sub-selection up until the desired number of selections (initialisation parameter n_to_select), the algorithm has terminated, or some score threshold has been reached (initialisation parameter score_threshold). Once the fitting is complete, the selector will contain the indices of the selections (relative to input matrices) in a member variable selected_idx_. The scores that evaluate all points in the context of these selections are available via the score function.

Farthest point sampling, CUR, and their PCovR-adaptations

Farthest point sampling (FPS) and CUR decomposition are available in unsupervised 41, 42 or hybrid (PCovFPS and PCovCUR) 7 variants for both feature and sample selection. In the hybrid variant, similar to the dimensionality reduction techniques discussed in Section "Linear and non-linear principal covariates regression" on Page 5, a parameter α is used to weight the supervised and unsupervised contributions in the covariance (for feature selection) or the Gram matrix (for sample selection) in the range from 0 to 1, with α = 1.0 directly corresponding to the unsupervised version of each algorithm published in Imbalzano et al. 42

FPS In Farthest Point Sampling (FPS), we start with a random point, and at each iteration select the point that maximises the distance to the previously selected set. This min-max distance is traditionally called the Hausdorff distance, which serves as the FPS scoring metric. The unsupervised variant uses the Euclidean distance as the default measure for the Hausdorff distance. The hybrid variant incorporates regression weights from an estimator into the distance calculation based upon the relationship between Euclidean distance and the corresponding covariance or Gram matrices of the input matrix.

Computing the Hausdorff distance, in any variable space, constructs an implicit Voronoi tessellation 43 that can be exploited to reduce the amount of distance calculations throughout the selection procedure. At each iteration, only the selected points at the boundaries of the Voronoi polyhedra need to be considered for new distance calculations. We have implemented a FPS variant computation of Voronoi skmatter.sample_selection.VoronoiFPS which results in large speedups for sample selections in dense datasets, where the effects of this algorithm can be largest. Benchmark results comparing classical FPS and the Voronoi version can be found in the supporting information of Cersonsky et al. 7

Workflow A typical workflow using FPS looks like this:

import skmatter.feature_selection.FPS
selector = FPS(
                 n_to_select=4,
                 progress_bar=True,
                 score_threshold=1E-12,
                 initialize=[0,1,2]
                )
selector.fit(X)

where the parameter n_to_select refers to the number of selections to be made, progress_bar signifies whether to show a tqdm-style progress bar 44 , score_threshold is the score to terminate the selection at (potentially before n_to_select), and initialise signifies the index or indices with which to start the selection procedure.

CUR selects features or samples based upon an “importance score” π, defined as the magnitude of a feature or sample’s projection along the first k principal components (unsupervised) or covariates (hybrid PCovCUR) of the overall dataset 7, 41, 42 . π serves as the scoring metric for CUR. In the iterative implementation, at each iteration, the remaining features/samples are orthogonalised to the previous selection, as this ultimately minimises the mutual information in each subsequent selection 42 .

Workflow A typical workflow using CUR looks like this:

import skmatter.feature_selection.CUR
selector = CUR(
                 n_to_select=4,
                 progress_bar=True,
                 score_threshold=1E-12,
                 k=1,
                 recompute_every=1
                )
selector.fit(X)

where, similar to FPS, the parameter n_to_select refers to the number of selections to be made, progress_bar signifies whether to show a tqdm-style progress bar, and score_threshold is the score to terminate the selection at (potentially before n_to_select). k is the number of components or covariates to consider, and the parameter recompute_every indicates after how many iterations to orthogonalise the remaining features or samples. The latter can speed computation (at the cost of including multiple redundant features or samples), as orthogonalising and recomputing the π score are the most costly steps in the algorithm.

Use Case: Nearsightedness of lattice energies in ice structures

There are two interesting regimes for feature or sample selection – the practical reduction of data spaces for efficiently computing supervised models and the interpretability of a small handful of selected features or samples. As the former is thoroughly covered in Cersonsky et al. 7 , we will focus on the latter here.

Here we employ the real space “Radial Spectrum” expanded on the O-H distances of the central oxygen environments for the stable ice crystals. The distribution of the ρOH1¯(r) features is shown in the top panel of Figure 7.

Figure 7. Selecting Features for Learning Lattice Energies.

Figure 7.

(Top) The real space “Radial Spectrum” ρOH1¯(r) for the hydrogen neighbours of central oxygen environments in the ice dataset. Each line corresponds to one oxygen environment. (Bottom) Selected features for both methods, noting the distance on the x-axis and selection method by colour (blue for PCovCUR and red for CUR). The number next to each point corresponds to the R 2 for the prediction of the lattice energies with a linear model upon selection of the feature.

We compare CUR with its hybrid variant PCovCUR to select 5 features. While the two selection methods choose a similar initial feature (corresponding to the hydrogen neighbours at ≈ 3.0 Å), the subsequent selected features diverge between the methods thereafter.

Within five selections, PCovCUR selects features that are capable of regressing the energies with fair accuracy, with an R 2 value of 0.96. PCovCUR selects features corresponding to the density values between the first and second (2 nd and 4 th feature), and the second and third neighbour shell (1 st, 3 rd and 5 th feature). This makes sense from a physical perspective – the greatest proportion of the lattice energy in a molecular solid such as ice comes from the intermolecular interactions close to the central atom. Especially when working with highly-redundant or non-efficient features, hybrid feature selection methods provide both improved performance and can reflect physical intuition.

Conversely, the unsupervised selection method CUR preferentially chooses the far-sighted features. This is explained by the fact that unsupervised methods will aim to maximise the resulting span of the feature space, and in this representation, far-sighted features show the greatest variance. 4 This also explains the poorer performance of regressions built on these features, which obtain a maximum R 2 value of 0.8.

Use Case: Choosing Features for Kernel Regression in the WHO Dataset

As shown in the previous use case, it is most effective to regress the life expectancies using a non-linear model. Here we show how feature selection mechanisms incorporating linear principal covariates yields comparable results to a recursive feature selection method based upon kernel regression.

We use the previously detailed selection schemes and compare them to Recursive Feature Selection (RFS) based upon kernel ridge regression on the training set. In the latter method, we iteratively select the feature that best improves regression performance, and this can serve as a putative “best selection” in datasets with this few features. For the FPS methods, we again choose the equivalent CUR-selected first feature to initialise the algorithm.

In each feature selection scheme, we select using the described methods and then compute the kernel ridge regression error for an offset test set. As shown in Figure 8, the parameters with greatest impact on the regression are a country’s GDP and prevalence of HIV/AIDS. This is unsurprising, as it is well-recognised in literature that the wealth of a country and prevalence of HIV/AIDS in a country directly impacts the life expectancy 4547 . This is shown in the jump in accuracy for PCov-FPS and PCov-CUR at the second selection as well as for FPS and CUR at the fifth and seventh selection, respectively. Otherwise, the most successful algorithms (RFS, PCov-CUR, and PCov-FPS) also incorporate different immunisations and health expenditures.

Figure 8. Feature Selection Methods Applied to the WHO Dataset.

Figure 8.

(top) Kernel regression performance for the different selection methods with increasing number of selected features. (bottom) Order of selected features for each selection method. We report the features in order of their selection by Recursive Feature Selection, which shows that the PCov-inspired methods most closely align with this “ground-truth”.

Directional convex hull

In the context of chemical sciences, a convex hull represents the set of structures that are thermodynamically stable from the perspective of a mixing model. In other words, if phases A, B, and C are of compatible stoichiometries, and A and B lie on the convex hull and phase C lies above, then C is unstable towards decomposition into a separated mixture of A and B ( Figure 9).

Figure 9. Schematic of a Convex Hull Construction.

Figure 9.

We first represent our data points across some variable (X, traditionally macroscopic density in thermodynamic convex hulls) and our energies. As shown by the black circles, our energetic convex hull lies at the bottom boundary of this plot. Programmatically, we use scipy.spatial.ConvexHull to determine the omnidirectional convex hull (red points), then select those simplices whose normal vector points negative to the energy. Any phase above the convex hull (e.g. C) will decompose into the two phases directly below it on the convex hull (e.g. A and B).

Usually, convex hulls are determined by a plot of the phases’ densities and energies. The convex hull is the set of points sitting on the lower boundary of this plot. Anelli et al. 48 showed that the comparative dimension can be a generalised latent variable that spans the diversity of the given dataset, in their case, a sketchmap component 49 .

Outside of chemical sciences, a convex hull is typically considered the set of points that define a bounding manifold, i.e., the vertices of the convex shape in which all other points are contained. This is implemented in Python in the package scipy 50 that employs Qhull 51 . We extend this functionality by allowing users to select the vertices subject to a directionality constraint (e.g., those points that define the surface below all other points with respect to a given observable). We have implemented this without explicit reference to chemical or energetic considerations, such that the DirectionalConvexHull can be used for similar tasks, including those shown in the mathematics 52 and economics 53 fields.

Implementation In scikit-matter, DirectionalConvexHull is exposed through the skmatter.sample_selection submodule. In addition to the traditional input matrix and target vector, the fit function takes as input which columns of X to consider in the convex hull construction, where the default behavior is to use the first column of X.

During the fit function, we employ scipy.spatial.ConvexHull to determine the omnidirectional convex hull. We then determine which of these points lie on simplices whose normal points downwards in the supplied property dimension. Like the selectors covered on Page 11, the indices of these points are saved into the class variable selected_idx_. Once the hull has been determined, the score_samples function determines the distance of given samples to the hull in the property dimension. One may also use the score_feature_matrix function to obtain the distance of samples to the hull in the dimensions not used to construct the directional convex hull. This distance in the higher-dimensional space is a measure of the putative stability of all non-hull phases, where phases near the hull are considered more stable than those further 48 .

Case Study: Determining the convex hull of ice structures

Engel et al. 14 determined a “generalised” convex hull that explicitly accounts for energetic and conformational uncertainty by using numerous trials of the algorithm detailed here. In each trial, they would perturb the atomic coordinates and energies of a selected number of structures within the uncertainty estimate of their calculations and compute the convex hull. They determined the convex hull as the points that are most often selected across many trials. Here, we will demonstrate one such trial by selecting the hull points using the REMatch kernel 54 as done in the original publication.

We build a two-component Kernel PCA on the normalised kernel matrix using the scikit-learn implementation. We construct our directional convex hull on these components, with the resulting hull shown in Figure 10.

Figure 10. The directional convex hull of ice structures.

Figure 10.

Here we show a directional convex hull, constructed using the first two KPCA dimensions as features and the per-molecule energies of ice structures as the target property. The small points correspond to all the structures in the dataset, whereas the larger points correspond to those that lie on the convex hull (shown by the grey surface), as determined using the fit function of scikit-matter.

In Figure 10, we see that the points that lie on the vertices of the directional convex hull are lower in energy relative to chemically similar points in their surroundings. As all structures that sit above the hull energetically will decompose into a mixture of the more stable phases, those on the vertices are more likely to be found in nature or be candidates for experimental synthesis. Although the KPCA features may appear abstract, they often correlate with characteristics of the structures, which are qualitatively analysable through chemiscope 55 or similar visualisation suites.

Conclusions

In this work we have demonstrated the use of scikit-matter, a scikit-learn extension that focuses on functions of particular relevance in materials science and chemistry. As the examples on the WHO dataset show, the features considered need not be chemically-based correlation functions, but might be economic predictors, diagnostic test results, or the engineered features of different autoencoder infrastructures. This illustrates how the compatibility of scikit-matter with scikit-learn allows a frictionless embedding of the demonstrated implementations into data-driven workflows in any domain. This has the capability of not only introducing beneficial algorithms to users that may be unfamiliar with them but also reduces the time-consuming implementation of these methods for knowledgeable users, thereby accelerating research endeavours across a wide range of fields. The module also provides a basis for further methods developed in materials science and chemistry communities to be integrated into a stable library in the future.

Installing scikit-matter

scikit-matter is available via the Python Package Index (PyPi) 56 and through GitHub. Independent developers are encouraged to contribute and can find more information on how to do so on the package documentation at scikit-matter.readthedocs.io.

Acknowledgements

We acknowledge contributions from the Laboratory of Computational Science and Modeling, particularly Sergey Podznyakov, Giulio Imbalzano, and Max Veit.

Funding Statement

V.P.P. and M.C. received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme Grant No. 101001890-FIAMMA. M.C., R.K.C., B.A.H. received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant No Grant No. 677013-HBMAP. A.G. and M.C. acknowledge support from the Swiss National Science Foundation (Project No. 200021-182057). G.F. acknowledges support from the Swiss Platform for Advanced Scientific Computing (PASC). R.K.C. acknowledges support from the University of Wisconsin - Madison and the Wisconsin Alumni Research Foundation (WARF)

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

[version 1; peer review: 1 approved, 1 approved with reservations]

Footnotes

1We optimise our basis functions starting from 100 basis functions of the DVR basis 4 .

2computed using DFTB+ 17 with the Third-Order Parameters for Organic and Biological Systems (3OB)

3Greedy algorithms choose points iteratively based upon a scoring metric, whereas non-greedy algorithms simultaneously choose points that meet certain criteria.

4To compensate this effect, a radial scaling is often applied to SOAP features, usually leading to improved regression of atomistic properties.

Data availability

Materials Cloud: Mapping uncharted territory in ice from zeolite networks to ice structures. DOI: 10.24435/materialscloud:2018.0010/v1. Engel et al. 16

The project employs the following underlying data:

  • dataset.xyz: Structures in XYZ format for 15,869 geometry-optimised ice structures. Optimisation was carried out using PBE-DFT in the CASTEP code, with a plane-wave energy cut-off of 490 eV, maximum k-point spacing of 2 π × 0.07 Å -1, and on-the-fly generated ultrasoft pseudopotentials.

  • properties.dat: Properties of all ice structures in dataset.xyz, including the CSD identifier, number of atoms per unit cell, density, configurational energy, and energy with respect to the energy-density convex hull.

We also employ publicly available datasets provided by The World Bank:

  • Life expectancy at birth, total (years), World Bank 18 .

  • Population, total, World Bank 19 .

  • GDP per capita (current USD), World Bank 20 .

  • Current health expenditure (percentage of GDP), World Bank 21 .

  • Government expenditure on education, total (percentage of GDP) 22 .

  • Prevalence of HIV, total (percentage of population ages 15–49) World Bank 23 .

  • Incidence of tuberculosis (per 100,000 people) World Bank 24 .

  • Immunization, measles (percentage of children ages 12–23 months), World Bank 25 .

  • Immunization, DPT (percentage of children ages 12–23 months) World Bank 26 .

  • Prevalence of undernourishment (percentage of population) World Bank 27 .

The aggregated WHO dataset can be retrieved through skmatter.datasets.load_who_dataset.

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

References

  • 1. Shapeev AV: Moment tensor potentials: A class of systematically improvable interatomic potentials. Multiscale Model Simul. 2016;14(3):1153–1173. 10.1137/15M1054183 [DOI] [Google Scholar]
  • 2. Drautz R: Atomic cluster expansion for accurate and transferable interatomic potentials. Phys Rev B. 2019;99(1):014104. 10.1103/PhysRevB.99.014104 [DOI] [Google Scholar]
  • 3. Deringer VL, Bartók AP, Bernstein N, et al. : Gaussian process regression for materials and molecules. Chem Rev. 2021;121(16):10073–10141. 10.1021/acs.chemrev.1c00022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Musil F, Grisafi A, Bartók AP, et al. : Physics-Inspired Structural Representations for Molecules and Materials. Chem Rev. 2021;121(16):9759–9815. 10.1021/acs.chemrev.1c00021 [DOI] [PubMed] [Google Scholar]
  • 5. Bartók AP, De S, Poelking C, et al. : Machine learning unifies the modeling of materials and molecules. Sci Adv. 2017;3(12):e1701816. 10.1126/sciadv.1701816 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Willatt MJ, Musil F, Ceriotti M: Feature optimization for atomistic machine learning yields a data-driven construction of the periodic table of the elements. Phys Chem Chem Phys. 2018;20(47):29661–29668. 10.1039/c8cp05921g [DOI] [PubMed] [Google Scholar]
  • 7. Cersonsky RK, Helfrecht BA, Engel EA, et al. : Improving sample and feature selection with principal covariates regression. Mach Learn: Sci Technol. 2021;2(3):035038. 10.1088/2632-2153/abfe7c [DOI] [Google Scholar]
  • 8. Parsaeifard B, De DS, Christensen AS, et al. : An assessment of the structural resolution of various fingerprints commonly used in machine learning. Mach Learn: Sci Technol. 2021;2(1):015018. 10.1088/2632-2153/abb212 [DOI] [Google Scholar]
  • 9. Goscinski A, Fraux G, Imbalzano G, et al. : The role of feature space in atomistic learning. Mach Learn: Sci Technol. 2021;2(2):025028. 10.1088/2632-2153/abdaf7 [DOI] [Google Scholar]
  • 10. Helfrecht BA, Cersonsky RK, Fraux G, et al. : Structure-property maps with kernel principal covariates regression. Mach Learn: Sci Technol. 2020;1(4):045021. 10.1088/2632-2153/aba9ef [DOI] [Google Scholar]
  • 11. Behler J: RuNNer. Reference Source [Google Scholar]
  • 12. Bartók-Pártay A, Bartók-Pártay L, Bianchini F, et al. : libAtoms+QUIP. 2020. Reference Source [Google Scholar]
  • 13. Novikov IS, Gubaev K, Podryabinkin EV, et al. : The MLIP package: moment tensor potentials with MPI and active learning. Mach Learn: Sci Technol. 2021;2(2):025002. 10.1088/2632-2153/abc9fe [DOI] [Google Scholar]
  • 14. Engel EA, Anelli A, Ceriotti M, et al. : Mapping uncharted territory in ice from zeolite networks to ice structures. Nat Commun. 2018;9(1):2173. 10.1038/s41467-018-04618-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Talirz L, Kumbhar S, Passaro E, et al. : Materials cloud, a platform for open computational science. Sci Data. 2020;7(1):299. 10.1038/s41597-020-00637-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Engel EA, Anelli A, Ceriotti M, et al. : Mapping uncharted territory in ice from zeolite networks to ice structures. 2018. 10.1038/s41467-018-04618-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Hourahine B, Aradi B, Blum V, et al. : DFTB+, a software package for efficient approximate density functional theory based atomistic simulations. J Chem Phys. 2020;152(12):124101. 10.1063/1.5143190 [DOI] [PubMed] [Google Scholar]
  • 18. World Bank: Life expectancy at birth, total (years).Technical report, The World Bank, Washington, DC,2023. Reference Source [Google Scholar]
  • 19. World Bank: Population, total.Technical report, The World Bank, Washington, DC,2023. Reference Source [Google Scholar]
  • 20. World Bank: Gdp per capita (current us$).Technical report, The World Bank, Washington, DC,2023. Reference Source [Google Scholar]
  • 21. World Bank: Current health expenditure (% of gdp).Technical report, The World Bank, Washington, DC,2023. Reference Source [Google Scholar]
  • 22. World Bank: Government expenditure on education, total (% of gdp).Technical report, The World Bank, Washington, DC,2023. Reference Source [Google Scholar]
  • 23. World Bank: Prevalence of hiv, total (% of population 15-49).Technical report, The World Bank, Washington, DC,2023. Reference Source [Google Scholar]
  • 24. World Bank: Incidence of tuberculosis (per 100,000 people).Technical report, The World Bank, Washington, DC,2023. Reference Source [Google Scholar]
  • 25. World Bank: Immunization, measles (% of children ages 12-23 months).Technical report, The World Bank, Washington, DC,2023. Reference Source [Google Scholar]
  • 26. World Bank: Immunization, dpt (% of children ages 12-23 months).Technical report, The World Bank, Washington, DC,2023. Reference Source [Google Scholar]
  • 27. World Bank: Prevalence of undernourishment (% of population). Technical report, The World Bank, Washington, DC,2023. Reference Source [Google Scholar]
  • 28. Bartók AP, Kondor R, Csányi G: On representing chemical environments. Phys Rev B. 2013;87(18):184115. 10.1103/PhysRevB.87.184115 [DOI] [Google Scholar]
  • 29. Prodan E, Kohn W: Nearsightedness of electronic matter. Proc Natl Acad Sci U S A. 2005;102(33):11635–8. 10.1073/pnas.0505436102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Caro MA: Optimizing many-body atomic descriptors for enhanced computational performance of machine learning based interatomic potentials. Phys Rev B. 2019;100(2):024112. 10.1103/PhysRevB.100.024112 [DOI] [Google Scholar]
  • 31. Kermode JR: QUIP. 2008. Reference Source [Google Scholar]
  • 32. Csányi G, Winfield S, Kermode JR, et al. : Expressive programming for computational physics in fortran 95+. IoP Comp Phys Newsletter. 2007. [Google Scholar]
  • 33. Kermode JR: f90wrap: an automated tool for constructing deep python interfaces to modern fortran codes. J Phys Condens Matter. 2020;32(30):305901. 10.1088/1361-648X/ab82d2 [DOI] [PubMed] [Google Scholar]
  • 34. Himanen L, Jäger MOJ, Morooka EV, et al. : DScribe: Library of descriptors for machine learning in materials science. Comput Phys Commun. 2020;247:106949. 10.1016/j.cpc.2019.106949 [DOI] [Google Scholar]
  • 35. Ceriotti M, Emsley M, Paruzzo F, et al. : Chemical shifts in molecular solids by machine learning datasets. Materials Cloud Archive. 2019. 10.24435/materialscloud:2019.0023/v2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Goscinski A, Fraux G, Imbalzano G, et al. : The role of feature space in atomistic learning. Mach Learn Sci Technol. 2021;2(2):025028. 10.1088/2632-2153/abdaf7 [DOI] [Google Scholar]
  • 37. Goscinski A, Musil F, Pozdnyakov S, et al. : Optimal radial basis for density-based atomic representations. J Chem Phys. 2021;155(10):104106. 10.1063/5.0057229 [DOI] [PubMed] [Google Scholar]
  • 38. de Jong S, Kiers HAL: Principal covariates regression: Part I. Theory. Chemometr Intell Lab Syst. 1992;14(1–3):155–164. 10.1016/0169-7439(92)80100-I [DOI] [Google Scholar]
  • 39. Schölkopf B, Smola A, Müller KR: Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation. 1998;10(5):1299–1319. 10.1162/089976698300017467 [DOI] [Google Scholar]
  • 40. Cersonsky TEK, Cersonsky RK, Saade GR, et al. : Placental lesions associated with stillbirth by gestational age, according to feature importance: results from the Stillbirth Collaborative Research Network. Placenta. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Mahoney MW, Drineas P: CUR matrix decompositions for improved data analysis. Proc Natl Acad Sci U S A. 2009;106(3):697–702. 10.1073/pnas.0803205106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Imbalzano G, Anelli A, Giofré D, et al. : automatic selection of atomic fingerprints and reference configurations for machine-learning potentials. J Chem Phys. 2018;148(24):241730. 10.1063/1.5024611 [DOI] [PubMed] [Google Scholar]
  • 43. Du Q, Faber V, Gunzburger M: Centroidal voronoi tessellations: Applications and algorithms. SIAM review. 1999;41(4):637–676. 10.1137/S0036144599352836 [DOI] [Google Scholar]
  • 44. da Costa-Luis C, Larroque SK, Altendorf K, et al. : tqdm: A fast, Extensible Progress Bar for Python and CLI. Zenodo. 2022. 10.5281/zenodo.7046742 [DOI] [Google Scholar]
  • 45. Mathers CD, Sadana R, Salomon JA, et al. : Healthy life expectancy in 191 countries, 1999. Lancet. 2001;357(9269):1685–1691. 10.1016/S0140-6736(00)04824-8 [DOI] [PubMed] [Google Scholar]
  • 46. Ashford LS: How HIV and AIDS affect populations. World. 2006;1:38–600. Reference Source [Google Scholar]
  • 47. Hansen CW: The relation between wealth and health: Evidence from a world panel of countries. Econ Lett. 2012;115(2):175–176. 10.1016/j.econlet.2011.12.031 [DOI] [Google Scholar]
  • 48. Anelli A, Engel EA, Pickard J, et al. : Generalized convex hull construction for materials discovery. Phys Rev Materials. 2018;2(10):103804. 10.1103/PhysRevMaterials.2.103804 [DOI] [Google Scholar]
  • 49. Ceriotti M, Tribello GA, Parrinello M: Simplifying the representation of complex free-energy land-scapes using sketch-map. Proceedings of the National Academy of Sciences. 2011;108(32):13023–13028. 10.1073/pnas.1108486108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Virtanen P, Gommers R, Oliphant TE, et al. : SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat Methods. 2020;17(3):261–272. 10.1038/s41592-019-0686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Barber CB, Dobkin DP, Huhdanpaa H: The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software (TOMS). 1996;22(4):469–483. 10.1145/235815.235821 [DOI] [Google Scholar]
  • 52. Liu W, Yuen SY, Chung KW, et al. : A general-purpose multi-dimensional convex landscape generator. Mathematics. 2022;10(21):3974. 10.3390/math10213974 [DOI] [Google Scholar]
  • 53. Anderson G, Crawford I, Leicester A: Efficiency analysis and the lower convex hull approach. Quantitative Approaches to Multidimensional Poverty Measurement. 2008;176–191. 10.1057/9780230582354_10 [DOI] [Google Scholar]
  • 54. De S, Bartók AP, Csányi G, et al. : Comparing molecules and solids across structural and alchemical space. Phys Chem Chem Phys. 2016;18(20):13754–13769. 10.1039/c6cp00415f [DOI] [PubMed] [Google Scholar]
  • 55. Fraux G, Cersonsky RK, Ceriotti M: Chemiscope: interactive structure-property explorer for materials and molecules. J Open Source Softw. 2020;5(51):2117. 10.21105/joss.02117 [DOI] [Google Scholar]
  • 56. Python package index - pypi. 2003. Reference Source [Google Scholar]
Open Res Eur. 2023 Jul 17. doi: 10.21956/openreseurope.17055.r31732

Reviewer response for version 1

Ekaterina I Izgorodina 1

The manuscript by Alexander Goscinski et al. “scikit-matter: A Suite of Generalisable Machine Learning Methods Born out of Chemistry and Materials Science” describes the implementation of four new features within scikit-matter that are applicable to study materials science. These features include feature reconstruction measures, Linear and Kernelized Principal Covariates Regression, Farthest Point Sampling, CUR, and their PCovR-Adaptations and Directional Convex Hull Construction. I enjoyed reading the manuscript and I think it has successfully showcased how new features can further improve the discovery of new materials. There are still several minor issue that must be addressed prior to indexing.

All subheadings are formatted equally which is rather confusing. For each which any subsection included within the feature section could be represented differently. For example, a smaller sized and italicised font could be used instead.

In terms of the first feature discussed in the manuscript – feature reconstruction measures – gives an example of how the number of basis functions used to represent neighbouring shells of water molecules could be much easier minimised by using this feature. The feature is based around the SOAP representation of a chemical system, in which each atom in the system is represented with a Gaussian density. I was wondering if a more general example could be given here. As the authors pointed out themselves, the SOAP representation is not well known outside of a very small community and therefore, this feature may not be of interest to a much broader community working on the discovery of new materials. Majority of scientists in the field use the smiles string representation to predict descriptors to be used in the discovery process. I was wondering if this feature could be used in combination with the smiles string to generate a density-based representation of the system.

Figure 5 is a little confusing. The authors argue that the combination of the PCovR kernel can distinguish among varying oxygen environments – hydroxide, hydronium cation and different arrangements of water. Can the three environments be elaborated and accompanied with corresponding figures that showcase difference in the oxygen environments. Equally important, it should be clearly stated how these environments were selected and why Mulliken charges were used for this purpose. I am a little hesitant about a wider applicability of Mulliken charges as this is one of the most approximate schemes to predict partial atomic charges and strongly relies on the basis set used. The last point should be comment on in the manuscript.

Figure 8 is difficult to understand. It took me a while to realise that essentially the order of the parameters on the y-axis are given with respect to the best performing model proposed in the manuscript – Recursive Feature Selection. The “jagged” trends for other models are slightly confusing as the clearly favoured a different order of the parameters. I was wondering gif there was a better and clear way to represent these data in a Figure.

Is the rationale for developing the new software tool clearly explained?

Yes

Is the description of the software tool technically sound?

Yes

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes

Reviewer Expertise:

Theoretical Chemistry, Molecular Dynamics and Machine Learning

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Open Res Eur. 2023 Aug 25.
Alexander Goscinski 1

The manuscript by Alexander Goscinski et al. “scikit-matter: A Suite of Generalisable Machine Learning Methods Born out of Chemistry and Materials Science” describes the implementation of four new features within scikit-matter that are applicable to study materials science. These features include feature reconstruction measures, Linear and Kernelized Principal Covariates Regression, Farthest Point Sampling, CUR, and their PCovR-Adaptations and Directional Convex Hull Construction. I enjoyed reading the manuscript and I think it has successfully showcased how new features can further improve the discovery of new materials. There are still several minor issue that must be addressed prior to indexing.

We thank Prof. Izgorodina for her kind words, and are happy that she finds the publication useful!

All subheadings are formatted equally which is rather confusing. For each which any subsection included within the feature section could be represented differently. For example, a smaller sized and italicised font could be used instead.

We changed the structure of the paper by adding a numeration to the sections as well as removing the method section to move all features one level higher. For example, the methods “FPS” and “CUR” are now subsubsections so that the paragraph “Workflow” can be clearly identified as part of the corresponding method.

In terms of the first feature discussed in the manuscript – feature reconstruction measures – gives an example of how the number of basis functions used to represent neighbouring shells of water molecules could be much easier minimised by using this feature. The feature is based around the SOAP representation of a chemical system, in which each atom in the system is represented with a Gaussian density. I was wondering if a more general example could be given here. As the authors pointed out themselves, the SOAP representation is not well known outside of a very small community and therefore, this feature may not be of interest to a much broader community working on the discovery of new materials. Majority of scientists in the field use the smiles string representation to predict descriptors to be used in the discovery process. I was wondering if this feature could be used in combination with the smiles string to generate a density-based representation of the system.

The main benefit of SMILES strings is to leverage advances in natural-language processing for chemical systems. While there are possibilities for scikit-matter to support NLP-like functionalities, currently scikit-learn is not the primary package used for NLP workflows, and therefore we wouldn’t necessarily argue that scikit-matter should first-and-foremost cater to NLP models. That being said, there are many ways to embed SMILES strings into numeric representations, and, in this context, we agree with the reviewer that we could make an effort to include the greater community in these case studies. We have added some notes to sections on how SMILES-like workflows could be augmented with these methods: Throughout the paper we use the Smooth Overlap of Atomic Positions (SOAP) representation to numerically encode our chemical data\cite{soap}. We note however, that other numerical encodings of chemical information are compatible with the methods presented here\cite{capecchi_one_2020}. Unfortunately, SMILES strings often lack the configurational information to generate a density-based representation, and there is an active, non-trivial effort in the field to predict thermodynamically-sensible configurations from connectivity information (e.g. SMILES), see DOIs 10.1039/d0cp04670a and 10.48550/arXiv.2106.07802. We feel that in order to include a useful example based on SMILES-type descriptors we would have to focus on the methodology to translates them into continuous features, which is, in our view, a pre-processing step that is orthogonal to the methods that are currently included in the package.  

Figure 5 is a little confusing. The authors argue that the combination of the PCovR kernel can distinguish among varying oxygen environments – hydroxide, hydronium cation and different arrangements of water. Can the three environments be elaborated and accompanied with corresponding figures that showcase difference in the oxygen environments

We thank Prof. Izgorodina for her comment – Figure 5 is meant to showcase the difference in chemical analyses between PCA and PCovR; PCovR can incorporate target information into structure-property maps, whereas PCA will be entirely chemically-determined. The importance here is not the environments themselves, but what we would consider “similar” environments, as visualization and mapping techniques often aim to represent similar data points nearby each other in latent space. When aiming to identify QSPR relationships, PCA is often invoked; however, PCovR is directly suited to this task, as sometimes, mappings which best represent chemical similarity (such as here) do not reflect similarity in properties as well. We invite the reviewer to explore herself the difference in the relationship using the interactive demo https://chemiscope.org/?load=https://github.com/lab-cosmo/skmatter-ore/raw/main/paper/chemiscope-oxygen-ice-structures-small.json.gz

Equally important, it should be clearly stated how these environments :were selected and why Mulliken charges were used for this purpose. I am a little hesitant about a wider applicability of Mulliken charges as this is one of the most approximate schemes to predict partial atomic charges and strongly relies on the basis set used. The last point should be comment on in the manuscript.

Here we use Mulliken charges not to make any larger scientific claim, but rather to provide a reasonable and easily-understandable target quantity for PCovR for a general audience. Please see our comments to Prof. Rosen on the exact computation and provenance of these quantities.  

Figure 8 is difficult to understand. It took me a while to realise that essentially the order of the parameters on the y-axis are given with respect to the best performing model proposed in the manuscript – Recursive Feature Selection. The “jagged” trends for other models are slightly confusing as the clearly favoured a different order of the parameters. I was wondering gif there was a better and clear way to represent these data in a Figure.

During the creation of the Figure we tried different orders of the y-axis and came to the conclusion that the order corresponding to the best regression performance is the clearest as this gives an interpretable importance score of the features. As the red methods are purely based on unsupervised information, it is clear that the order does not need to be correlated with RFE, therefore resulting in the “jagged” trend. We explained this in the text more precisely The relationship between the selectors becomes more apparent the similarity is plotted using the weighted Kendall's $\tau$\cite{shieh_weighted_1998} as illustrated in \figref{fig:who_sel}b). A strong similarity is observed between the selector and their PCovR adaption (e.g. CUR and PCovCUR) as well as among the supervised selectors (PCovCUR, PCovFPS and RFA). While the supervised selectors incorporate the same information of the targets to assess feature importance, the unsupervised selectors rely on different mathematical assumptions for assessment, resulting in lower similarity. On the other hand the feature selection schemes that are based solely on the target information (blue) are quite strongly correlating with the recursive feature selection. We updated the figure so this can be seen much easier. For that we changed the model to a linear model to remove the effect of nonlinearities in the feature selection process. In addition we added a heatmap of similarity between the selected features of the selector using the selected order of the corresponding feature selection method as rank. In this heatmap the correlation can be seen clearly.

Open Res Eur. 2023 Jun 19. doi: 10.21956/openreseurope.17055.r31730

Reviewer response for version 1

Andrew S Rosen 1

# Overall Comments

In their submitted work, Goscinski et al. have described and released `scikit-matter`, a Python package built around the `scikit-learn` API that provides a suite of feature selection/analysis tools that are likely to be of particular interest to the computational materials science community. After going through the paper and corresponding code, I am confident that this work warrants acceptance in Open Research Europe. To the best of my knowledge, there is no comparable package published to date, and the code itself fills a niche between pure ML regression codes like `scikit-learn` and ML featurization codes like `QUIP` and `DScribe`.

In short, I recommend this work for acceptance in Open Research Europe with only minor revisions, mainly focused on making the documentation more accessible to a reader who may not (yet) be familiar with the underlying techniques and perhaps has not (yet) read the corresponding paper. I also have a few very minor comments and questions about the paper, as described below.

Congratulations to the authors on this nice work, and I look forward to seeing `scikit-matter` continue to evolve in the future.

# The Paper

## Overall Comments

Overall, I found the paper to be excellent. I learned a lot from it, and I thought it was exceptionally clear. I found the examples to be very insightful, such as the convergence test for the number of basis functions and the convex hull analysis broken down by the principal components. The paper (and corresponding code) strike a unique and admittedly challenging balance between being domain-agnostic yet also catering to methods that are likely to be especially useful to computational material scientists. I do not have any major comments on the paper, which I think will be a very useful reference for machine learning practitioners and students alike.

## Feature Reconstruction Measures

This section was very clear and effective! As someone who has not used feature reconstruction measures before, one of the questions I had when reading the text was the following: "for a given dataset of interest, wouldn't doing a standard evaluation of the model cross-validation error for two separate representations also tell me which is more information-rich?" With this in mind, perhaps it might be worth taking a sentence to explain the major benefit of the feature reconstruction method approach. Is it simply that it's a more direct measure of the ability of the representation to reproduce the true data?

The authors state that "As most interesting applications use large feature vectors, we implemented a custom RidgeRegression2FoldCV in the skmatter.linear_model module to improve computational efficiency." This immediately made me think about SOAP and how, particularly over a dataset spanning much of the periodic table in terms of unique elements, the feature vectors can be absolutely enormous. This often makes something like kernel ridge regression (and, therefore, the hyperparameter tuning process) very memory- and compute-intensive. Could a method like GFRE be used to narrow down the range of SOAP-based hyperparameters to search over by plotting a convergence curve like in Figure 2? Is that the primary use case?

## Principal Covariates Regression

I thought this section was great, and I don't have any major comments. The demonstration of how PCovR can be very beneficial in KRR (particularly with non-linear kernel functions) was very insightful.

## FPS, CUR, and Hulls

Again, I thought this final section was written very clearly. I have one extremely minor comment. On the bottom-left of Page 13, the text states "As all structures that sit above the hull energetically will decompose into a mixture of the more stable phases...". Technically, the convex hull is based solely on thermodynamics and therefore can only describe thermodynamic (in)stability, if I understand right. Therefore, if the kinetic barrier is high for the transition, it's not necessarily guaranteed that the structure will decompose (i.e. the structure could be meta-stable).

## Methods

- Could you please include the version of the DFTB+ code that was used to generate the Mulliken partial charges? Are there any other details that need to be included here for reproducibility purposes, such as chosen Slater-Koster files?

- What code was used to generate the features in the paper? In particular, I know that different codes have implemented SOAP differently, so for the sake of reproducibility, it would be beneficial to know any pertinent details (e.g. for reproducing Figure 2).

## Citations

The citations are thorough and appropriate. You may also wish to consider the following:

- You may wish to cite `scikit-learn` since `scikit-matter` is so heavily built around it. See https://scikit-learn.org/stable/about.html#citing-scikit-learn for the citation information.

- You may wish to cite Bartel, _J. Mater. Sci._ (2022) https://link.springer.com/article/10.1007/s10853-022-06915-4 in the convex hull section since most people aren't very familiar with convex hull analyses and the Bartel paper is extremely accessible to a broad audience in case the reader wishes to learn more.

# The Code

## GitHub Repo

Everything looks great here! I'm very happy to see the significant test coverage. Kudos! The Python package has all the ingredients for a robust and well-maintained code.

## Installation

Everything worked as expected!

That said, I do have one very minor suggestion. Although it's in the `pyproject.toml` file, less experienced Pythonistas may not be aware of what versions of Python are supported by `scikit-matter`. So, it might be worth either mentioning that in the docs or adding the associated [PyPI badge]( https://shields.io/) to the `README.md`.

## Documentation

Overall, I am very pleased by the documentation! My main suggestion has to do with the fact that most users will likely check out the code/docs before ever opening the paper, and right now the documentation isn't sufficient to explain to the user why this code might be for them. For instance, perhaps someone stumbled upon the code on GitHub after someone else star'd it. Without reading the paper, they are going to see "PCov-FPS extends upon FPS much like PCov-CUR does to CUR" on the "What's in sckit-matter" page and likely be confused as to what this means and may doubt that this is a relevant package for them.

Thankfully, this is an easy fix in my opinion. The paper itself is extremely clear and has the key components for what I would also like to see in the docs: a brief overview of each technique and why it might be useful, along with a minimal working example for each. You can simply adapt snippets of text and/or figures from your paper straight into the docs and it'd likely be perfect. This would best be done in a new "Methods" or "Theory" section, perhaps mimicking the section for "Descriptors" in the [DScribe]( https://singroup.github.io/dscribe/latest/tutorials/tutorials.html#descriptors) package.

There are admittedly some details in the API Reference section already, but it's somewhat technical. The API Reference section is also really meant for current users of the code who want to look up the details of a particular function/class. It's not really meant for new users who are trying to figure out the methodology, which is why I think the Python package would be best served by having an additional section in the documentation dedicated to this.

Overall, I would say this is probably the most important comment I have about the submitted work that I hope to see addressed in a future (minor) revision.

### Examples

The examples in the documentation are great, and the ability to directly download the code as a Jupyter Notebook is very welcome. As a suggestion for further improvement, I think it would be especially helpful to provide a few sentences of context at the start of each example to orient the user about what the goal of the case-study is. Setting the stage a bit and defining the scope would go a long way, as it wasn't immediately clear to me what the goal was in each example until I completed it. Like above, I also think users who haven't read the paper or references therein likely won't know what PCovR means, for instance, and therefore won't immediately know which examples are relevant to their interests/needs. Providing some background to motivate things at a higher level would help close this gap.

### Contributing/References

Everything is very clear here in terms of guidelines! Same for the references. I have no suggestions here.

### Miscellaneous

In case the authors are curious, I made a fork of the repo and ran the codebase through several GitHub-based code analyzers, including [Deepsource]( https://deepsource.io), [Codacy]( https://app.codacy.com/), and [Sourcery](https://sourcery.ai/), to see if there were any glaring bugs or major issues. Good news --- I didn't see anything crucial! Most of the hits were stylistic in nature. This isn't too surprising given the high test coverage. Nice!

Is the rationale for developing the new software tool clearly explained?

Yes

Is the description of the software tool technically sound?

Yes

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes

Reviewer Expertise:

Computational materials chemistry, quantum-chemical methods, machine learning for materials discovery.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

References

  • 1. : Review of computational approaches to predict the thermodynamic stability of inorganic solids. Journal of Materials Science .2022;57(23) : 10.1007/s10853-022-06915-4 10475-10498 10.1007/s10853-022-06915-4 [DOI] [Google Scholar]
Open Res Eur. 2023 Aug 25.
Alexander Goscinski 1

Answer to Andrew S. Rosen

# Overall Comments

In their submitted work, Goscinski et al. have described and released `scikit-matter`, a Python package built around the `scikit-learn` API that provides a suite of feature selection/analysis tools that are likely to be of particular interest to the computational materials science community. After going through the paper and corresponding code, I am confident that this work warrants acceptance in Open Research Europe. To the best of my knowledge, there is no comparable package published to date, and the code itself fills a niche between pure ML regression codes like `scikit-learn` and ML featurization codes like `QUIP` and `DScribe`.

In short, I recommend this work for acceptance in Open Research Europe with only minor revisions, mainly focused on making the documentation more accessible to a reader who may not (yet) be familiar with the underlying techniques and perhaps has not (yet) read the corresponding paper. I also have a few very minor comments and questions about the paper, as described below.

Congratulations to the authors on this nice work, and I look forward to seeing `scikit-matter` continue to evolve in the future.  

We thank Prof. Rosen for his kind words, and are happy that he finds the publication suitable for ORE.  

# The Paper

## Overall Comments

Overall, I found the paper to be excellent. I learned a lot from it, and I thought it was exceptionally clear. I found the examples to be very insightful, such as the convergence test for the number of basis functions and the convex hull analysis broken down by the principal components. The paper (and corresponding code) strike a unique and admittedly challenging balance between being domain-agnostic yet also catering to methods that are likely to be especially useful to computational material scientists. I do not have any major comments on the paper, which I think will be a very useful reference for machine learning practitioners and students alike.  

Again, we thank Prof. Rosen for his kind words.  

## Feature Reconstruction Measures This section was very clear and effective! As someone who has not used feature reconstruction measures before, one of the questions I had when reading the text was the following: "for a given dataset of interest, wouldn't doing a standard evaluation of the model cross-validation error for two separate representations also tell me which is more information-rich?" With this in mind, perhaps it might be worth taking a sentence to explain the major benefit of the feature reconstruction method approach. Is it simply that it's a more direct measure of the ability of the representation to reproduce the true data?

It is correct that applying the feature reconstruction error (FRE) on a target property as ground truth is equivalent to a regression. One extension of the FRE is to use an arbitrary set of features as ground truth as it is more suitable to compare features when doing a quantitative assessment of different featurizations of the same representations (e.g. different radial basis functions). We have added the following to the manuscript to clarify this point: The feature set that results in the higher reconstructive measure is considered higher in information capacity (e.g. if FRM(ℱ, G) > FRM(G, ℱ), then ℱ is the more information-rich feature set). The advantage of the FRM is that it can give quantitative results about the shared information content between two sets of features, even without the existence of a ground truth. The simplest FRM is the global feature reconstruction error (GFRE), expressed as the linearly decodable information, as given by performing ridge regression between the two sets of features. We note that in the case of choosing the property as ground truth, the GFRE is equivalent to a regression task on the standardized property.   

The authors state that "As most interesting applications use large feature vectors, we implemented a custom RidgeRegression2FoldCV in the skmatter.linear_model module to improve computational efficiency." This immediately made me think about SOAP and how, particularly over a dataset spanning much of the periodic table in terms of unique elements, the feature vectors can be absolutely enormous. This often makes something like kernel ridge regression (and, therefore, the hyperparameter tuning process) very memory- and compute-intensive. Could a method like GFRE be used to narrow down the range of SOAP-based hyperparameters to search over by plotting a convergence curve like in Figure 2? Is that the primary use case?  

Prof. Rosen is correct that this is one beneficial use case for GFRE; optimizing the regularization parameter over a large grid but no other hyperparameters. We have added the following statements to the manuscript: In \sklearn an implementation of the leave-one-out CV with the similar purpose of speeding up the CV exists. A detailed comparison to this method can be found in one of the examples which are included in the documentation of \skmat \url{https://scikit-matter.readthedocs.io/en/v0.2.0/examples/regression/Ridge2FoldCVRegularization.html}.   that includes a new sphinx gallery example in the documentation with more detailed analysis.  

On the bottom-left of Page 13, the text states "As all structures that sit above the hull energetically will decompose into a mixture of the more stable phases...". Technically, the convex hull is based solely on thermodynamics and therefore can only describe thermodynamic (in)stability, if I understand right. Therefore, if the kinetic barrier is high for the transition, it's not necessarily guaranteed that the structure will decompose (i.e. the structure could be meta-stable).

It is true that kinetics can be an obstacle to spinodal decomposition – convex hulls are typically used to assess thermodynamic processes, as opposed to kinetic ones. We have added the phrase “From a thermodynamic standpoint,” at the beginning of this sentence to clarify the physics here.  

Could you please include the version of the DFTB+ code that was used to generate the Mulliken partial charges? Are there any other details that need to be included here for reproducibility purposes, such as chosen Slater-Koster files?

To the best of our knowledge, these charges were computed using DFTB+ version 21.2/22.1 and ase version 3.22.1. We apologize for the uncertainty, as an untimely computer crash has left some ambiguity in the DFTB+ versioning. DFTB+ changes between these two versions would have no impact on the ultimate computations. We used the 3ob Slater-Koster files downloaded from the dftbplus repository and the following code to generate all charges for a given frame from the ice dataset:   from ase.units import Hartree, Bohr from ase.calculators.dftb import Dftb calc = Dftb(                 atoms=frame,                 label=directory + "/dftb".format(frame.info["name"]),                 Hamiltonian_MaxAngularMomentum_="",                 Hamiltonian_MaxAngularMomentum_O="p",                 Hamiltonian_MaxAngularMomentum_H="s",                 kpts=(1, 1, 1),                 Hamiltonian_SCC="Yes",                 Hamiltonian_MaxSCCIterations=1000,                 Hamiltonian_Charge=0.0,                 Hamiltonian_DampXH="Yes",                 Hamiltonian_DampXHExponent=4.0,                 Hamiltonian_HubbardDerivs_="",                 Hamiltonian_HubbardDerivs_O=-0.1575,                 Hamiltonian_HubbardDerivs_H=-0.1857,                 Hamiltonian_ThirdOrderFull="Yes", ) frame.calc = calc calc.calculate(frame) frame.arrays["charges"] = frame.get_charges()  

This code, as well as all codes (including those written below) used to generate figures in the paper, has been archived at https://github.com/lab-cosmo/skmatter-ore. This link has been added to the “Data Availability” section Data are available under the terms of the \href{https://creativecommons.org/licenses/by/4.0/}{Creative Commons Attribution 4.0 International license} (CC-BY 4.0). The scripts together with the required software packages used to generate the figures are available in the repository https://github.com/lab-cosmo/skmatter-ore in the form of executed jupyter notebooks with output.

What code was used to generate the features in the paper? In particular, I know that different codes have implemented SOAP differently, so for the sake of reproducibility, it would be beneficial to know any pertinent details (e.g. for reproducing Figure 2).  

We used librascal (github.com/lab-cosmo/librascal, commit 4e8f9fb) to compute the SOAP features using the following code. We included in the manuscript a link to the repository that contains the notebooks used for the computation of all figures. The specific hypers for each figure can be found in the corresponding notebooks You can find the hypers of Figure 2 in the following notebook https://github.com/lab-cosmo/skmatter-ore/blob/main/paper/atomrings.ipynb 

You may wish to cite `scikit-learn` since `scikit-matter` is so heavily built around it. See https://scikit-learn.org/stable/about.html#citing-scikit-learn for the citation information. Thank you for the advice, a needed citation hidden-in-plain-sight!

You may wish to cite Bartel, J. Mater. Sci. (2022) https://link.springer.com/article/10.1007/s10853-022-06915-4 in the convex hull section since most people aren't very familiar with convex hull analyses and the Bartel paper is extremely accessible to a broad audience in case the reader wishes to learn more.

Thanks for the suggestion! We have been sure to add in this appropriate citation.  

That said, I do have one very minor suggestion. Although it's in the `pyproject.toml` file, less experienced Pythonistas may not be aware of what versions of Python are supported by `scikit-matter`. So, it might be worth either mentioning that in the docs or adding the associated [PyPI badge]( https://shields.io/) to the `README.md`.  

We updated the metadata of our package and it is now visible on pypi. We also specify what OS and Python version we perform the CI on (the suggestion resulted in the PR https://github.com/scikit-learn-contrib/scikit-matter/pull/200)  

The paper itself is extremely clear and has the key components for what I would also like to see in the docs: a brief overview of each technique and why it might be useful, along with a minimal working example for each. You can simply adapt snippets of text and/or figures from your paper straight into the docs and it'd likely be perfect. This would best be done in a new "Methods" or "Theory" section, perhaps mimicking the section for "Descriptors" in the [DScribe]( https://singroup.github.io/dscribe/latest/tutorials/tutorials.html#descriptors) package.

We thank Prof. Rosen for the nice comment and have made appropriate changes to the documentation by adding a “Getting started” section, see https://scikit-matter.readthedocs.io/en/latest/getting-started.html   

I think it would be especially helpful to provide a few sentences of context at the start of each example to orient the user about what the goal of the case-study is. Setting the stage a bit and defining the scope would go a long way, as it wasn't immediately clear to me what the goal was in each example until I completed it. Like above, I also think users who haven't read the paper or references therein likely won't know what PCovR means, for instance, and therefore won't immediately know which examples are relevant to their interests/needs. Providing some background to motivate things at a higher level would help close this gap.

Thanks for the comment, we hope with the documentation overhaul we meet the criteria of Prof. Rosen https://scikit-matter.readthedocs.io/en/latest/index.html We put a lot of effort into making it more friendly to less experienced users.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Availability Statement

    Materials Cloud: Mapping uncharted territory in ice from zeolite networks to ice structures. DOI: 10.24435/materialscloud:2018.0010/v1. Engel et al. 16

    The project employs the following underlying data:

    • dataset.xyz: Structures in XYZ format for 15,869 geometry-optimised ice structures. Optimisation was carried out using PBE-DFT in the CASTEP code, with a plane-wave energy cut-off of 490 eV, maximum k-point spacing of 2 π × 0.07 Å -1, and on-the-fly generated ultrasoft pseudopotentials.

    • properties.dat: Properties of all ice structures in dataset.xyz, including the CSD identifier, number of atoms per unit cell, density, configurational energy, and energy with respect to the energy-density convex hull.

    We also employ publicly available datasets provided by The World Bank:

    • Life expectancy at birth, total (years), World Bank 18 .

    • Population, total, World Bank 19 .

    • GDP per capita (current USD), World Bank 20 .

    • Current health expenditure (percentage of GDP), World Bank 21 .

    • Government expenditure on education, total (percentage of GDP) 22 .

    • Prevalence of HIV, total (percentage of population ages 15–49) World Bank 23 .

    • Incidence of tuberculosis (per 100,000 people) World Bank 24 .

    • Immunization, measles (percentage of children ages 12–23 months), World Bank 25 .

    • Immunization, DPT (percentage of children ages 12–23 months) World Bank 26 .

    • Prevalence of undernourishment (percentage of population) World Bank 27 .

    The aggregated WHO dataset can be retrieved through skmatter.datasets.load_who_dataset.

    Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).


    Articles from Open Research Europe are provided here courtesy of European Commission, Directorate General for Research and Innovation

    RESOURCES