Simple and interpretable data-driven descriptor accurately predicts the synthesizability of single and double perovskites.
Abstract
Predicting the stability of the perovskite structure remains a long-standing challenge for the discovery of new functional materials for many applications including photovoltaics and electrocatalysts. We developed an accurate, physically interpretable, and one-dimensional tolerance factor, τ, that correctly predicts 92% of compounds as perovskite or nonperovskite for an experimental dataset of 576 ABX3 materials (X = O2−, F−, Cl−, Br−, I−) using a novel data analytics approach based on SISSO (sure independence screening and sparsifying operator). τ is shown to generalize outside the training set for 1034 experimentally realized single and double perovskites (91% accuracy) and is applied to identify 23,314 new double perovskites (A2BB′X6) ranked by their probability of being stable as perovskite. This work guides experimentalists and theorists toward which perovskites are most likely to be successfully synthesized and demonstrates an approach to descriptor identification that can be extended to arbitrary applications beyond perovskite stability predictions.
INTRODUCTION
Crystal structure prediction from chemical composition continues as a persistent challenge to accelerated materials discovery (1, 2). Most approaches capable of addressing this challenge require several computationally demanding electronic-structure calculations for each material composition, limiting their use to a small set of materials (3–6). Alternatively, descriptor-based approaches enable high-throughput screening applications because they provide rapid estimates of material properties (7, 8). Notably, the Goldschmidt tolerance factor, t (9), has been used extensively to predict the stability of the perovskite structure based only on the chemical formula, ABX3, and the ionic radii, ri, of each ion (A, B, X)
(1) |
The perovskite crystal structure, as shown in Fig. 1A, is defined as any ABX3 compound with a network of corner-sharing BX6 octahedra surrounding a larger A-site cation (rA > rB), where the cations, A and B, can span the periodic table and the anion, X, is typically a chalcogen or halogen. Distortions from the cubic structure can arise from size mismatch of the cations and anion, which results in additional perovskite structures and nonperovskite structures. The B cation can also be replaced by two different ions, resulting in the double perovskite formula, A2BB′X6 (Fig. 1B). Single and double perovskite materials have exceptional properties for a variety of applications such as electrocatalysis (10), proton conduction (11), ferroelectrics (12) (using oxides, X = O2−), battery materials (13) (using fluorides, X = F−), as well as photovoltaics (14) and optoelectronics (15) (using the heavier halides, X = Cl−, Br−, I−).
The first step in designing new perovskites for these applications is typically the assessment of stability using t, which has informed the design of perovskites for over 90 years. However, as reported in recent studies, its accuracy is often insufficient (16). Considering 576 ABX3 solids experimentally characterized at ambient conditions and reported in (17–19) (see Fig. 1C for the A, B, and X elements in this set), t correctly distinguishes between perovskite and nonperovskite for only 74% of materials and performs considerably worse for compounds containing heavier halides [chlorides (51% accuracy), bromides (56%), and iodides (33%)] than for oxides (83%) and fluorides (83%) (Fig. 2A, fig. S1, and table S1). This deficiency in generalization to halide perovskites severely limits the applicability of t for materials discovery.
In this work, we present a new tolerance factor (τ), which has the form
(2) |
where nA is the oxidation state of A, ri is the ionic radius of ion i, rA > rB by definition, and τ < 4.18 indicates perovskite. A high overall accuracy of 92% for the experimental set (94% for a randomly chosen test set of 116 compounds) and nearly uniform performance across the five anions evaluated [oxides (92% accuracy), fluorides (92%), chlorides (90%), bromides (93%), and iodides (91%)] is achieved with τ (Fig. 2B, fig. S1, and table S1). Like t, the prediction of perovskite stability using τ requires only the chemical composition, allowing the tolerance factor to be agnostic to the many structures that are considered perovskite. In addition to predicting if a material is stable as perovskite, τ also provides a monotonic estimate of the probability that a material is stable in the perovskite structure. The accurate and probabilistic nature of τ, as well as its generalizability over a broad range of single and double perovskites, allows new physical insights into the stability of the perovskite structure and the prediction of thousands of new double perovskite oxides and halides, 23,314 of which are provided here and ranked by their probability of being stable in the perovskite structure.
RESULTS AND DISCUSSION
Finding an improved tolerance factor to predict perovskite stability
One key aspect of the performance of t is how well the sum of ionic radii estimates the interatomic bond distances for a given structure. Shannon’s revised effective ionic radii (20) based on a systematic empirical assessment of interatomic distances in nearly 1000 compounds are the typical choice for radii because they provide ionic radius as a function of ion, oxidation state, and coordination number for the majority of elements. Most efforts to improve t have focused on refining the input radii (17, 19, 21, 22) or increasing the dimensionality of the descriptor through two-dimensional (2D) structure maps (18, 23, 24) or high-dimensional machine-learned models (25–27). However, all hitherto applied approaches for improving the Goldschmidt tolerance factor are only effective over a limited range of ABX3 compositions. Despite its modest classification accuracy, t remains the primary descriptor used by experimentalists and theorists to predict the stability of perovskites.
The SISSO (sure independence screening and sparsifying operator) approach (28) was used to identify an improved tolerance factor for predicting whether a given compound is perovskite [determined by experimental realization of any structure with corner-sharing BX6 octahedra (21) at ambient conditions] or nonperovskite [determined by experimental realization of any structure(s) without corner-sharing BX6 octahedra, including, in some cases, failed synthesis of any ABX3 compound]. Of the 576 experimentally characterized ABX3 solids, 80% were used to train and 20% were used to test the SISSO-learned descriptor. Several alternative atomic properties were considered as candidate features, and among them, SISSO determined that the best performing descriptor, τ (Eq. 2 and Fig. 2B), depends only on oxidation states and Shannon ionic radii (see Materials and Methods for an explanation of the approach used for descriptor identification and a discussion of alternative approaches). For the set of 576 ABX3 compositions, τ correctly labels 94% of the perovskites and 89% of the nonperovskites compared with 94 and 49%, respectively, using t. The primary advantage of τ over t is the remarkable reduction in compounds that are predicted to be perovskite but are not experimentally identified as stable perovskites, with false-positive rates for τ and t of 11 and 51%, respectively. Full confusion matrices along with additional performance metrics for τ and t are provided in table S2. The large decrease in false-positive rate (from 51% to 11%) while substantially increasing the overall classification accuracy (from 74% to 92%) demonstrates that τ improves significantly upon t as a reliable tool to guide experimentalists toward which compounds can be synthesized in perovskite structures.
Beyond the improved accuracy, a crucial advantage of τ is the monotonic (continuous) dependence of perovskite stability on τ. As τ decreases, the τ-based probability of being perovskite, P(τ), increases, where perovskites are expected for an empirically determined range of τ < 4.18 (Fig. 2B; Materials and Methods for details). Probabilities are obtained using Platt’s scaling (29), where the binary classification of perovskite/nonperovskite is transformed into a continuous probability estimate of perovskite stability, P(τ), by training a logistic regression model on the τ-derived binary classification. Probabilities cannot similarly be obtained with t because the stability of the perovskite structure does not increase or decrease monotonically with t, where 0.825 < t < 1.059 results in a classification as perovskite (this range maximizes the classification accuracy of t on the set of 576 compounds). While P(τ) is sigmoidal with respect to τ because of the logistic fit (fig. S2), a bell-shaped behavior of P(τ) with respect to t is observed because of the multiple decision boundaries required for t (Fig. 2C). This relationship leads to an increase in P(τ) (i.e., probability of perovskite stability using τ), with an increase in t until a value of t ~ 0.9. Beyond this range, the probabilities level out or decrease as t increases further.
The disparity between the τ-derived perovskite probability, P(τ), and the assignment by t can be significant, especially in the range where t predicts a stable perovskite (0.825 < t < 1.059). A comparison of the perovskite (LaAlO3) and the nonperovskite (NaBeCl3) illustrates the discrepancy between these two approaches. t incorrectly predicts both compounds to be perovskite (t = 1.0), whereas P(τ) varies from <10% for NaBeCl3 to >97% for LaAlO3, in agreement with the experimental results. For NaBeCl3, instability in the perovskite structure arises from an insufficiently large Be2+ cation on the B site, which leads to unstable BeCl6 octahedra. This contribution to perovskite stability is accounted for in the first term of τ (Eq. 2, rX/rB = μ−1, where μ is the octahedral factor).
μ is the typical choice for a second feature used in combination with t (18, 19, 23) and was recently used to assess the predictive accuracy of Goldschmidt’s “no-rattling” principle. In this analysis, six inequalities dependent on t and μ were derived and used to predict the formability of single and double perovskites with a reported accuracy of ~80% (30). Notably, training a decision tree algorithm on the bounds of t and μ that optimally separate perovskite from nonperovskite leads to a classification accuracy of 85% for this dataset (fig. S3). In contrast to these 2D descriptors based on (t, μ), τ incorporates μ as a 1D descriptor yet still achieves a higher accuracy of 92%, demonstrating the capability of the SISSO algorithm to identify a highly accurate tolerance factor composed of intuitively meaningful parameters.
The nature of geometrical descriptors, such as t or μ, is fundamentally different than that of data-driven descriptors, such as τ. t and μ are derived from geometric constraints that indicate when the perovskite structure is a possible structure that can form. However, these constraints do not necessarily indicate when the perovskite structure is the ground-state structure and does form. For instance, if t = 1 and the ionic limit on which t was derived is applicable (the interatomic distances are sums of the ionic radii), these criteria do not suggest that perovskite is the ground-state structure, only that the interatomic distances are such that the lattice constants in the A-X and B-X directions can be commensurate with the perovskite structure. The fact that t does not guarantee the formation of the perovskite structure is evident by the high false-positive rate (51%) in the region of t where perovskite is expected (0.825 < t < 1.059). Similarly, although μ may fall within the range where BX6 octahedra are expected based on geometric considerations (0.414 < μ < 0.732), the octahedra that form may be edge or face sharing, and therefore, the observed structure is nonperovskite. In this work, SISSO searches a massive space of potential descriptors to identify the one that most successfully detects when a given chemical formula will or will not crystallize in the perovskite structure, and because this is the target property, τ emerges as a much more predictive descriptor than t or μ.
Although the classification by τ disagrees with the experimental label for 8% of the 576 compounds, the agreement increases to 99% outside the range 3.31 < τ < 5.92 (200 compounds) and 100% outside the range 3.31 < τ < 12.08 (152 compounds). The experimental dataset may also be imperfect as compounds can manifest different crystal structures as a function of the synthesis conditions due to, e.g., defects in the experimental samples (impurities, vacancies, etc.). These considerations emphasize the usefulness of τ-derived probabilities, in addition to the binary classification of perovskite/nonperovskite, which address these uncertainties in the experimental data and corresponding classification by τ.
Comparing τ to calculated perovskite stabilities
The precise and probabilistic nature of τ, as well as its simple functional form—depending only on widely available Shannon radii (and the oxidation states required to determine the radii)—enables the rapid search across composition space for stable perovskite materials. Before attempting synthesis, it is common for new materials to be examined using computational approaches; therefore, it is useful to compare the predictions from τ with those obtained using density functional theory (DFT). The stabilities (decomposition enthalpies, ΔHd) of 73 single and double perovskite chalcogenides and halides were recently examined with DFT using the Perdew-Burke-Ernzerhof (31) exchange-correlation functional (DFT) (32, 33). τ is found to agree with the calculated stability for 64 of 73 calculated materials. Importantly, the probabilities that result from classification with τ linearly correlate with ΔHd, demonstrating the value of the monotonic behavior of τ and P(τ) (Fig. 2D and table S3).
Although τ appears to disagree with these DFT calculations for nine compounds, six disagreements lie near the decision boundaries [P(τ) = 0.5, ΔHd = 0 meV/atom], suggesting that they cannot be confidently classified as stable or unstable perovskites using τ or DFT calculations of the cubic structure. Of the remaining disagreements, CaZrO3 and CaHfO3 reveal the power of τ compared with DFT calculations of the cubic structure, as these two oxides are known to be isostructural with the orthorhombic perovskite CaTiO3, from which the name perovskite originates (34, 35). ΔHd < −90 meV/atom for these two compounds in the cubic structure, indicating that they are nonperovskites. In contrast, τ predicts both compounds to be stable perovskites with ~65% probability, which agrees with the experimental results. These results show that a key challenge in the prediction of perovskite stability from quantum chemical calculations is the requirement of a specific structure as an input, as there are more than a dozen unique structures classified as perovskite (i.e., those having corner-sharing BX6 octahedra) and many more that are nonperovskite.
Several recent machine-learned descriptors for perovskite stability have been trained or tested on DFT-calculated stabilities of only the cubic perovskite structure (33, 36–38). However, less than 10% of perovskites are observed experimentally in this structure (21), leading to an inherent disagreement between the descriptor predictions and experimental observations. Recently, it was shown that of 254 synthesized perovskite oxides (ABO3), DFT calculations in the Open Quantum Materials Database (39) predict only 186 (70%) to be stable or even moderately unstable (within 100 meV/atom of the convex hull) (27). The discrepancy is likely associated with the difference in energy between the true perovskite ground state and the calculated high-symmetry structure(s). Because τ was trained exclusively on the experimental characterization of ABX3 compounds, τ is informed by the true ground-state (or metastable but observed) structure of each ABX3 and the potential for these compounds to decompose into any compound(s) in the A-B-X composition space. A principal advantage of τ over many existing descriptors is that its identification and validation were based on experimentally observed stability or instability of a structurally diverse dataset.
Extension to double perovskite oxides and halides
Double perovskites are particularly intriguing as an emerging class of semiconductors that offer a lead-free alternative to traditional perovskite photoabsorbers and an increased compositional tunability for enhancing desired properties such as catalytic activity (10, 16, 40). Still, the experimentally realized composition space of double perovskites is relatively unexplored compared with the number of possible A, B, B′, and X combinations that can form A2BB’X6 compounds. The set of 576 compounds used for training and testing τ is composed of 49 A cations, 67 B cations, and 5 X anions, from which >500,000 double perovskite formulas, A2BB′X6, can be constructed. Comparison with the Inorganic Crystal Structure Database (ICSD) (30, 41) reveals only 918 compounds (<0.2%) with known crystal structures, 868 of which are perovskite.
Although τ was only trained on ABX3 compounds, it is readily adaptable to double perovskites because it depends only on composition and not structure. To extend τ to A2BB′X6 formulas, rB is approximated as the arithmetic mean of the two B-site radii (rB, rB′). τ correctly classifies 91% of these 918 A2BB′X6 compounds in the ICSD (compared with 92% on 576 ABX3 compounds), recovering 806 of 868 known double perovskites (table S4). The geometric mean has also been used to approximate the radius of a site with two ions (42). We find that this has little effect on classification with τ, as 91% of the 918 A2BB′X6 compounds are also correctly classified using the geometric mean for rB, and the classification label differs for only 14 of 918 compounds using the arithmetic or geometric mean. Although τ was identified using 460 ABX3 compounds, the agreement with experiment on these compounds (92%) is comparable to that on the 1034 compounds (91%) that span ABX3 (116 compounds) and A2BB′X6 (918 compounds) formulas and was completely excluded from the development of τ (i.e., test set compounds). This result indicates pronounced generalizability to predicting experimental realization for single and double perovskites that are yet to be discovered. With τ thoroughly validated as being predictive of experimental stability, the space of yet-undiscovered double perovskites was explored to identify 23,314 charge-balanced double perovskites that τ predicts to be stable at ambient conditions (of >500,000 candidates). These compounds are provided in table S4 including assigned oxidation states and radii along with t and τ, predictions made using each tolerance factor, and classification in the ICSD where available. There are thousands of additional compounds with substitutions on the A and/or X sites, AA′BB′(XX′)3, that are expected to be similarly rich in yet-undiscovered perovskite compounds.
Two particularly attractive classes of materials within this set of A2BB′X6 compounds are double perovskites with A = Cs+, X = Cl− and A = La3+, X = O2−, which have garnered substantial interest in a number of applications including photovoltaics, electrocatalysis, and ferroelectricity. The ICSD contains 45 compounds (42 perovskites) with the formula CsBB′Cl6, 43 of which are correctly classified as perovskite or nonperovskite by τ. From the high-throughput analysis using τ, we predict an additional 420 perovskites to be stable with 164 having at least the probability of perovskite formation as the recently synthesized perovskite, Cs2AgBiCl6 [P(τ) = 69.6%] (43). A map of perovskite probabilities for charge-balanced Cs2BB′Cl6 compounds is shown in Fig. 3 (lower triangle). Within this set of 164 probable perovskites, there is an opportunity to synthesize double perovskite chlorides that contain 3d transition metals substituted on one or both B sites, as 83 new compounds of this type are predicted to be stable as perovskite with high probability.
While double perovskite oxides have been explored extensively for a number of applications, the small radius and favorable charge of O2− yields a massive design space for the discovery of new compounds. For La2BB′O6, ~63% of candidate compositions are found to be charge-balanced compared with only ~24% of candidate Cs2BB′Cl6 compounds. The ICSD contains 85 La2BB′O6 compounds, all of which are predicted to be perovskite by τ in agreement with the experiment. We predict an additional 1128 perovskites to be discoverable in this space, with a remarkable 990 having P(τ) ≥ 85% (Fig. 3, upper triangle). All 128 ABX3 compounds in the experimental set that meet this threshold are experimentally realized as perovskite, suggesting that there is ample opportunity for perovskite discovery in lanthanum oxides.
Compositional mapping of perovskite stability
In addition to enabling the rapid exploration of stoichiometric perovskite compositions, τ provides the probability of perovskite stability, P(τ), for an arbitrary combination of nA, rA, rB, and rX, which is shown in Fig. 4. For each grouping shown in Fig. 4, experimentally realized perovskites and nonperovskites are shown as single points to compare with the range of values in the predictions made from τ. Doping at various concentrations presents a nearly infinite number of A1−xA′xB1−yB′y(X1−zX′z)3 compositions that allows the tuning of technologically useful properties. τ suggests the size and concentration of dopants on the A, B, or X sites that likely lead to improved stability in the perovskite structure. Conversely, compounds that lie in the high-probability region are likely amenable to ionic substitutions that decrease the probability of forming a perovskite but may improve a desired property for another application. For example, LaCoO3, with P(τ) = 98.9%, should accommodate reasonable ionic substitutions (i.e., A sites of comparable size to La or B sites of comparable size to Co) and was recently shown to have enhanced oxygen exchange capacity and nitric oxide oxidation kinetics with stable substitutions of Sr on the A site (44).
The probability maps in Fig. 4 arise from the functional form of τ (Eq. 2) and provide insights into the stability of the perovskite structure as the size of each ion is varied. The perovskite structure requires that the A and B cations occupy distinct sites in the ABX3 lattice, with A 12-fold and B 6-fold coordinated by X. When rA and rB are too similar, nonperovskite lattices that have similarly coordinated A and B sites, such as cubic bixbyite, become preferred over the perovskite structure. On the basis of the construct of τ, as rA/rB → 1, P(τ) → 0, which arises from the +x/ln(x) (x = rA/rB) term, where and larger values of τ lead to lower probabilities of forming perovskites. When rA = rB, τ is undefined, yet compounds where A and B have identical radii are rare and not expected to adopt perovskite structures (t = 0.71).
The octahedral term in τ (rX/rB) also manifests itself in the probability maps, particularly in the lower bound on rB where perovskites are expected as rX is varied. As rX increases, rB must similarly increase to enable the formation of stable BX6 octahedra. This effect is noticeable when separately comparing compounds containing Cl− (left), Br− (center), and I− (right) (bottom row of Fig. 4), where the range of allowed cation radii decreases as the anion radius increases. For rB << rX, rX/rB becomes large, which increases τ and therefore decreases the probability of stability in the perovskite structure. This accounts for the inability of small B-site ions to sufficiently separate X anions in BX6 octahedra, where geometric arguments suggest that B is sufficiently large to form BX6 octahedra only for rB/rX > 0.414. Because the cation radii ratios strongly affect the probability of perovskite, as discussed in the context of x/ln(x), rX also has a noticeable indirect effect on the lower bound of rA, which increases as rX increases.
The role of nA in τ is more difficult to parse, but its placement dictates two effects on stability—as A is more oxidized (increasing nA), −nA2 increases the probability of forming the perovskite structure, but nA also magnifies the effect of the x/ln(x) term, increasing the importance of the cation radii ratio. Notably, nA = 1 for most halides and some oxides (245 of the 576 compounds in our set), and in these cases, for all combinations of A, B, and X and nA plays no role as the composition is varied.
This analysis illustrates how data-driven approaches not only can be used to maximize the predictive accuracy of new descriptors but also can be leveraged to understand the actuating mechanisms of a target property—in this case, perovskite stability. This attribute distinguishes τ from other descriptors for perovskite stability that have emerged in recent years. For instance, three recent works have shown that the experimental formability of perovskite oxides and halides can be separately predicted with high accuracy using kernel support vector machines (26), gradient boosted decision trees (25), or a random forest of decision trees (27). While these approaches can yield highly accurate models, the resulting descriptors are not documented analytically, and therefore, the mechanism by which they make the perovskite/nonperovskite classification is opaque.
CONCLUSIONS
We report a new tolerance factor, τ, that enables the prediction of experimentally observed perovskite stability significantly better than the widely used Goldschmidt tolerance factor, t, and the 2D structure map using t and the octahedral factor, μ. For 576 ABX3 and 918 A2BB′X6 compounds, the prediction by τ agrees with the experimentally observed stability for >90% of compounds, with >1000 of these compounds reserved for testing generalizability (prediction accuracy). The deficiency of t arises from its functional form and not the input features, as the calculation of τ requires the same inputs as t (composition, oxidation states, and Shannon ionic radii). Thus, τ enables a superior prediction of perovskite stability with negligible computational cost. The monotonic and 1D nature of τ allows the determination of perovskite probability as a continuous function of the radii and oxidation states of A, B, and X. These probabilities are shown to linearly correlate with DFT-computed decomposition enthalpies and help clarify how chemical substitutions at each of the sites modulate the tendency for perovskite formation. Using τ, we predict the probability of double perovskite formation for thousands of unexplored compounds, resulting in a library of stable perovskites ordered by their likelihood of forming perovskites. Because of the simplicity and accuracy of τ, we expect its use to accelerate the discovery and design of state-of-the-art perovskite materials for applications ranging from photovoltaics to electrocatalysis.
MATERIALS AND METHODS
Radii assignment
To develop a descriptor that takes as input the chemical composition and outputs a prediction of perovskite stability, the features that comprise the descriptor must also be based only on composition. However, it is not known a priori which cation will occupy the A or B site given only a chemical composition, CC′X3 (C and C′ being cations). Therefore, we developed a systematic method for determining which cation is A or B to enable τ to be applied to an arbitrary new material. First, a list of allowed oxidation states is defined for each cation based on Shannon’s radii (20). All pairs of oxidation states for C and C′ that charge-balance X3 are considered. If more than one charge-balanced pair exists, a single pair is chosen on the basis of the electronegativity ratio of the two cations (χC/χC′). If 0.9 < χC/χC′ < 1.1, the pair that minimizes |nC – nC′| is chosen, where nC is the oxidation state for C. Otherwise, the pair that maximizes |nC – nC′| is chosen. With the oxidation states of C and C′ assigned, the values of the Shannon radii for the cations occupying the A and B sites are chosen to be closest to the coordination number of 12 and 6, which are consistent with the coordination environments of the A and B cations in the perovskite structure. Last, the radii of the C and C′ cations were compared, and the larger cation is assigned as the A-site cation. This strategy reproduced the assignment of the A and B cations for 100% of 313 experimentally labeled perovskites.
Selection of τ
For the identification of τ among the offered candidates, the oxidation states (nA, nB, nX), ionic radii (rA, rB, rX), and radii ratios (rA/rB, rA/rX, rB/rX) comprise the primary features, Φ0, where Φn refers to the descriptor space with n iterations of complexity as defined in (28). For example, Φ1 refers to the primary features (Φ0), together with one iteration of algebraic/functional operations applied to each feature in Φ0. Φ2 then refers to the application of algebraic/functional operations to all potential descriptors in Φ1, and so forth. Note that Φm contains all potential descriptors within Φn<m, with a filter to remove redundant potential descriptors. For the discovery of τ, complexity up to Φ3 is considered, yielding ~3 × 109 potential descriptors. An alternative would be to exclude the radii ratios from Φ0 and construct potential descriptors with complexity up to Φ4. However, given the minimal Φ0 = [nA, nB, nX, rA, rB, rX], there are ~108 potential descriptors in Φ3, so ~1016 potential descriptors would be expected in Φ4 (based on ~102 being present in Φ1 and ~1 × 104 in Φ2), and this number is impractical to screen using available computing resources.
The dataset of 576 ABX3 compositions was partitioned randomly into an 80% training set for identifying candidate descriptors and a 20% test set for analyzing the predictive ability of each descriptor. The top 100,000 potential descriptors most applicable to the perovskite classification problem were identified using one iteration of SISSO with a subspace size of 100,000. Each descriptor in the set of ~3 × 109 was ranked according to domain overlap, as described by Ouyang et al. (28). To identify a decision boundary for classification, a decision tree classifier with a maximum depth of two was fit to the top 100,000 candidate descriptors ranked based on domain overlap. Domain overlap (and not decision tree performance) was used as the SISSO ranking metric because of the much lower computational expense associated with applying this metric. Notably, τ was the 14,467th highest ranked descriptor by SISSO using the domain overlap metric, and hence, this defines the minimum subspace required to identify τ using this approach. Without evaluating a decision tree model for each descriptor in the set of ~3 × 109 potential descriptors, we cannot be certain that a subspace size of 100,000 is sufficient to find the best descriptor. However, the identification of τ within a subspace as small as 15,000 suggests that a subspace size of 100,000 is sufficiently large to efficiently screen the much larger descriptor space. We have also conducted a test on this primary feature space (Φ0 = [nA, nB, nX, rA, rB, rX, rA/rB, rA/rX, rB/rX]) with a subspace size of 500,000. Even after increasing the subspace size by 5×, τ remains the highest performing descriptor (a classification accuracy of 92% on the 576-compound set). An important distinction between the SISSO approach described here and by Ouyang et al. (28) is the choice of sparsifying operator (SO). In this work, domain overlap was used to rank the features in SISSO, but a decision tree with a maximum depth of two was used as the SO (instead of domain overlap) to identify the best descriptor of those selected by SISSO. This alternative SO was used to decrease the leverage of individual data points, as the experimental labeling of perovskite/nonperovskite is prone to some ambiguity based on synthesis conditions, defects, and other experimental considerations.
The benefit of including the radii ratios in Φ0 was made clear by comparing the performance of τ to the best descriptor obtained using the minimal primary feature space with Φ0 = [nA, nB, nX, rA, rB, rX]. Repeating the procedure used to identify τ yields a Φ3 with ~1 × 108 potential descriptors. The best 1D descriptor was found to be , with a classification accuracy of 89%.
Alternative features
We also considered the effects of including properties outside of those required to compute t or τ. Beginning with Φ0 = [nA, nB, nX, rA, rB, rX, rcov,A, rcov,B, rcov,X, IEA, IEB, IEX, χA, χB, χX], where rcov,i is the empirical covalent radius of neutral element i, IEi is the empirical first ionization energy of neutral element i, and χi is the Pauling electronegativity of element i, all taken from WebElements (45), an aggregation of a number of references that are available within. Repeating the procedure used to identify τ results in ~6 × 1010 potential descriptors in Φ3. The best performing 1D descriptor was found to be with a classification accuracy of 90%, lower than τ that makes use of only the oxidation states and ionic radii and is only slightly higher than the accuracy of the descriptor obtained using the minimal feature set.
Increasing dimensionality
To assess the performance of descriptors with increased dimensionality, following the approach to higher dimensional descriptor identification using SISSO described in (28), the residuals from classification by τ (those misclassified by the decision tree, Fig. 2B) were used as the target property in the search for a second dimension to include with τ. From the same set of ~3 × 109 potential descriptors constructed to identify τ, the 100,000 1D descriptors that best classify the 41 training set compounds misclassified by τ were identified on the basis of domain overlap. Each of these 100,000 descriptors was paired with τ, and the performance of each 2D descriptor was assessed using a decision tree with a maximum depth of two. The best performing 2D descriptor was found to be , with a classification accuracy of 95% on the 576-compound set. Improvements are expected to diminish as the dimensionality increases further due to the iterative nature of SISSO and the higher-order residuals used for subspace selection. Although the second dimension leads to slightly improved classification performance on the experimental set compared with τ, the simplicity and monotonicity of τ, which enables physical interpretation and the extraction of meaningful probabilities, support its selection instead of the more complex 2D descriptor. The benefits and capabilities of having a meaningfully probabilistic 1D tolerance factor, such as τ, are described in detail within the main text.
Potential for overfitting
The SISSO algorithm as implemented here selects τ from a space of ~3 × 109 candidate descriptors, and the only parameter that is fit is the optimum value of τ that defines the decision boundary for classification as perovskite or nonperovskite, τ = 4.18. This decision boundary was optimized using a decision tree to maximize the classification accuracy on the training set of 460 compounds. In this case, Gini impurity was minimized to optimize the decision boundary, but alternative cost functions based on Kullback-Leibler divergence or classification accuracy (e.g., l2) would find the same decision boundary. The SISSO descriptor identification is done from billions of candidates, but these functions comprise a discrete set, i.e., they form a basis in a large dimensional space where the number of training points is the dimensionality of the space, which is not densely covered by the functions. Therefore, the selection of only one function, τ, cannot overfit the data. However, if some physical mechanism determining the stability of perovskites is not represented in the training set, it might be missed by the learned formula (here, τ), and therefore, the generalizability of the model would be hampered. However, the 94% accuracy achieved by τ on the excluded set of 116 compounds shows that τ can generalize outside of the training data.
Alternative radii for more covalent compounds
Ionic radii are required inputs for τ (and t), and although the Shannon effective ionic radii are ubiquitous in solid-state materials research, a new set of B2+ radii was recently proposed for 18 cations to account for how their effective cationic radii vary as a function of increased covalency with the heavier halides (19). These revised radii apply to 129 of the 576 experimentally characterized compounds compiled in this dataset (62% of halides). Using these revised radii results in a 5% decrease in the accuracy of τ to 86% for these 129 compounds compared to a classification accuracy of 91% using the Shannon radii for these same compounds. The application of τ using Shannon radii for presumably covalent compounds was further validated by noting that τ correctly classifies 37 of 40 compounds that contain Sn or Pb and achieves an accuracy of 91% for 141 compounds with X = Cl−, Br−, or I−. In addition to the higher accuracy achieved by τ when using Shannon radii, we note that the Shannon radii are more comprehensive than the revised radii in (19), applying to more ions, oxidation states, and coordination environments, and are thus recommended for the calculation of τ.
Computer packages used
SISSO was performed using Fortran 90. Platt’s scaling (29) was used to extract classification probabilities for τ by fitting a logistic regression model on the decision tree classifications using threefold cross-validation. Decision tree fitting and Platt scaling were performed within the Python package scikit-learn. Data visualizations were generated within the Python packages Matplotlib and Seaborn.
Supplementary Material
Acknowledgments
We thank A. Holder for helpful discussions regarding the manuscript. Funding: This project has received funding from the European Union’s Horizon 2020 research and innovation program (#676580: The NOMAD Laboratory—A European Center of Excellence and #740233: TEC1p), the Berlin Big-Data Center (BBDC, #01IS14013E), and BiGmax, the Max Planck Society’s Research Network on Big-Data-Driven Materials-Science. C.J.B. acknowledges support from a U.S. Department of Education Graduate Assistantship in Areas of National Need. C.S. acknowledges funding by the Alexander von Humboldt Foundation. C.B.M. acknowledges support from NSF award CBET-1433521, which was cosponsored by the NSF and the U.S. Department of Energy (DOE), Office of Energy Efficiency and Renewable Energy (EERE), Fuel Cell Technologies Office and from DOE award EERE DE-EE0008088. Part of this research was performed using computational resources sponsored by the U.S. DOE, Office of EERE and located at the National Renewable Energy Laboratory. Author contributions: M.S. and C.J.B. conceived the idea. C.J.B., C.S., and B.R.G. designed the studies. C.J.B. performed the studies. C.J.B., C.S., and B.R.G. analyzed the results and wrote the manuscript. R.O. provided the SISSO algorithm and facilitated its implementation. C.B.M., L.M.G., and M.S. supervised the project. All the authors discussed the results and implications and edited the manuscript. Competing interests: The authors declare that they have no competing financial interests. Data and materials availability: A repository containing all files necessary for classifying ABX3 and AA′BB′(XX′)3 compositions as perovskite or nonperovskite using τ is available at https://github.com/CJBartel/perovskite-stability. A graphical interface allowing users to classify compounds with τ is also available at https://analytics-toolkit.nomad-coe.eu. The classification of all compounds shown in the manuscript is available in the Supplementary Materials. All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Additional data related to this paper may be requested from the authors.
SUPPLEMENTARY MATERIALS
REFERENCES AND NOTES
- 1.Pauling L., The principles determining the structure of complex ionic crystals. J. Am. Chem. Soc. 51, 1010–1026 (1929). [Google Scholar]
- 2.Woodley S. M., Catlow R., Crystal structure prediction from first principles. Nat. Mater. 7, 937–946 (2008). [DOI] [PubMed] [Google Scholar]
- 3.Kirkpatrick S., Gelatt C. D. Jr., Vecchi M. P., Optimization by simulated annealing. Science 220, 671–680 (1983). [DOI] [PubMed] [Google Scholar]
- 4.Doye J. P. K., Wales D. J., Thermodynamics of global optimization. Phys. Rev. Lett. 80, 1357–1360 (1998). [Google Scholar]
- 5.Goedecker S., Minima hopping: An efficient search method for the global minimum of the potential energy surface of complex molecular systems. J. Chem. Phys. 120, 9911–9917 (2004). [DOI] [PubMed] [Google Scholar]
- 6.Oganov A. R., Lyakhov A. O., Valle M., How evolutionary crystal structure prediction works—And why. Acc. Chem. Res. 44, 227–237 (2011). [DOI] [PubMed] [Google Scholar]
- 7.Curtarolo S., Hart G. L. W., Buongiorno Nardelli M., Mingo N., Sanvito S., Levy O., The high-throughput highway to computational materials design. Nat. Mater. 12, 191–201 (2013). [DOI] [PubMed] [Google Scholar]
- 8.Ghiringhelli L. M., Vybiral J., Levchenko S. V., Draxl C., Scheffler M., Big data of materials science: Critical role of the descriptor. Phys. Rev. Lett. 114, 105503 (2015). [DOI] [PubMed] [Google Scholar]
- 9.Goldschmidt V. M., Die gesetze der krystallochemie. Naturwissenschaften 14, 477–485 (1926). [Google Scholar]
- 10.Hwang J., Rao R. R., Giordano L., Katayama Y., Yu Y., Shao-Horn Y., Perovskites in catalysis and electrocatalysis. Science 358, 751–756 (2017). [DOI] [PubMed] [Google Scholar]
- 11.Duan C., Tong J., Shang M., Nikodemski S., Sanders M., Ricote S., Almansoori A., O’Hayre R., Readily processed protonic ceramic fuel cells with high performance at low temperatures. Science 349, 1321–1326 (2015). [DOI] [PubMed] [Google Scholar]
- 12.Cohen R. E., Origin of ferroelectricity in perovskite oxides. Nature 358, 136–138 (1992). [Google Scholar]
- 13.Yi T., Chen W., Cheng L., Bayliss R. D., Lin F., Plews M. R., Nordlund D., Doeff M. M., Persson K. A., Cabana J., Investigating the intercalation chemistry of alkali ions in fluoride perovskites. Chem. Mater. 29, 1561–1568 (2017). [Google Scholar]
- 14.Correa-Baena J.-P., Saliba M., Buonassisi T., Grätzel M., Abate A., Tress W., Hagfeldt A., Promises and challenges of perovskite solar cells. Science 358, 739–744 (2017). [DOI] [PubMed] [Google Scholar]
- 15.Kovalenko M. V., Protesescu L., Bodnarchuk M. I., Properties and potential optoelectronic applications of lead halide perovskite nanocrystals. Science 358, 745–750 (2017). [DOI] [PubMed] [Google Scholar]
- 16.Li W., Wang Z., Deschler F., Gao S., Friend R. H., Cheetham A. K., Chemically diverse and multifunctional hybrid organic–inorganic perovskites. Nat. Rev. Mater. 2, 16099 (2017). [Google Scholar]
- 17.Zhang H., Li N., Li K., Xue D., Structural stability and formability of ABO3-type perovskite compounds. Acta Crystallogr. B 63, 812–818 (2007). [DOI] [PubMed] [Google Scholar]
- 18.Li C., Lu X., Ding W., Feng L., Gao Y., Guo Z., Formability of ABX3 (X = F, Cl, Br, I) halide perovskites. Acta Crystallogr. B 64, 702–707 (2008). [DOI] [PubMed] [Google Scholar]
- 19.Travis W., Glover E. N. K., Bronstein H., Scanlon D. O., Palgrave R. G., On the application of the tolerance factor to inorganic and hybrid halide perovskites: A revised system. Chem. Sci. 7, 4548–4556 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Shannon R. D., Revised effective ionic radii and systematic studies of interatomic distances in halides and chalcogenides. Acta Crystallogr. A 32, 751–767 (1976). [Google Scholar]
- 21.Lufaso M. W., Woodward P. M., Prediction of the crystal structures of perovskites using the software program SPuDS. Acta Crystallogr. B 57, 725–738 (2001). [DOI] [PubMed] [Google Scholar]
- 22.Kieslich G., Sun S., Cheetham A. K., Solid-state principles applied to organic–inorganic perovskites: New tricks for an old dog. Chem. Sci. 5, 4712–4715 (2014). [Google Scholar]
- 23.Li C., Soh K. C. K., Wu P., Formability of ABO3 perovskites. J. Alloys Compd. 372, 40–48 (2004). [Google Scholar]
- 24.Becker M., Klüner T., Wark M., Formation of hybrid ABX3 perovskite compounds for solar cell application: First-principles calculations of effective ionic radii and determination of tolerance factors. Dalton Trans. 46, 3500–3509 (2017). [DOI] [PubMed] [Google Scholar]
- 25.Pilania G., Balachandran P. V., Gubernatis J. E., Lookman T., Classification of ABO3 perovskite solids: A machine learning study. Acta Crystallogr. B 71, 507–513 (2015). [DOI] [PubMed] [Google Scholar]
- 26.Pilania G., Balachandran P. V., Kim C., Lookman T., Finding new perovskite halides via machine learning. Front. Mater. 3, 19 (2016). [Google Scholar]
- 27.Balachandran P. V., Emery A. A., Gubernatis J. E., Lookman T., Wolverton C., Zunger A., Predictions of new ABO3 perovskite compounds by combining machine learning and density functional theory. Phys. Rev. Mater. 2, 043802 (2018). [Google Scholar]
- 28.Ouyang R., Curtarolo S., Ahmetcik E., Scheffler M., Ghiringhelli L. M., SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. Phys. Rev. Mater. 2, 083802 (2018). [Google Scholar]
- 29.J. C. Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, in Advances in Large Margin Classifiers, A. J. Smola, P. B. Bartlett, B. Schölkopf, D. Schuurmans, Eds. (MIT Press, 1999), vol. 10, 61–74. [Google Scholar]
- 30.Filip M. R., Giustino F., The geometric blueprint of perovskites. Proc. Natl. Acad. Sci. U.S.A. 115, 5397–5402 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Perdew J. P., Burke K., Ernzerhof M., Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865–3868 (1996). [DOI] [PubMed] [Google Scholar]
- 32.Zhao X.-G., Yang D., Sun Y., Li T., Zhang L., Yu L., Zunger A., Cu–In halide perovskite solar absorbers. J. Am. Chem. Soc. 139, 6718–6725 (2017). [DOI] [PubMed] [Google Scholar]
- 33.Sun Q., Yin W.-J., Thermodynamic stability trend of cubic perovskites. J. Am. Chem. Soc. 139, 14905–14908 (2017). [DOI] [PubMed] [Google Scholar]
- 34.Megaw H. D., Crystal structure of double oxides of the perovskite type. Proc. Phys. Soc. 58, 133 (1946). [Google Scholar]
- 35.Feteira A., Sinclair D. C., Rajab K. Z., Lanagan M. T., Crystal structure and microwave dielectric properties of alkaline-earth hafnates, AHfO3 (A=Ba, Sr, Ca). J. Am. Ceram. Soc. 91, 893–901 (2008). [Google Scholar]
- 36.Schmidt J., Shi J., Borlido P., Chen L., Botti S., Marques M. A. L., Predicting the thermodynamic stability of solids combining density functional theory and machine learning. Chem. Mater. 29, 5090–5103 (2017). [Google Scholar]
- 37.Faber F. A., Lindmaa A., von Lilienfeld O. A., Armiento R., Machine learning energies of 2 million elpasolite (ABC2D6) crystals. Phys. Rev. Lett. 117, 135502 (2016). [DOI] [PubMed] [Google Scholar]
- 38.Xie T., Grossman J. C., Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett. 120, 145301 (2018). [DOI] [PubMed] [Google Scholar]
- 39.Kirklin S., Saal J. E, Meredig B., Thompson A., Doak J. W., Aykol M., Rühl S., Wolverton C., The open quantum materials database (OQMD): Assessing the accuracy of DFT formation energies. npj Comput. Mater. 1, 15010 (2015). [Google Scholar]
- 40.Kamat P. V., Bisquert J., Buriak J., Lead-free perovskite solar cells. ACS Energy Lett. 2, 904–905 (2017). [Google Scholar]
- 41.Hellenbrandt M., The Inorganic Crystal Structure Database (ICSD)—Present and future. Crystallogr. Rev. 10, 17–22 (2004). [Google Scholar]
- 42.Li W., Ionescu E., Riedel R., Gurlo A., Can we predict the formability of perovskite oxynitrides from tolerance and octahedral factors? J. Mater. Chem. A 1, 12239–12245 (2013). [Google Scholar]
- 43.McClure E. T., Ball M. R., Windl W., Woodward P. M., Cs2AgBiX6 (X = Br, Cl): New visible light absorbing, lead-free halide perovskite semiconductors. Chem. Mater. 28, 1348–1354 (2016). [Google Scholar]
- 44.Choi S. O., Penninger M., Kim C. H., Schneider W. F., Thompson L. T., Experimental and computational investigation of effect of Sr on NO oxidation and oxygen exchange for La1–xSrxCoO3 perovskite catalysts. ACS Catal. 3, 2719–2728 (2013). [Google Scholar]
- 45.Source: WebElements, www.webelements.com/.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.