Chemical shifts in molecular solids by machine learning

Federico M Paruzzo; Albert Hofstetter; Félix Musil; Sandip De; Michele Ceriotti; Lyndon Emsley

doi:10.1038/s41467-018-06972-x

. 2018 Oct 29;9:4501. doi: 10.1038/s41467-018-06972-x

Chemical shifts in molecular solids by machine learning

Federico M Paruzzo ¹, Albert Hofstetter ¹, Félix Musil ², Sandip De ², Michele Ceriotti ^2,^✉, Lyndon Emsley ^1,^✉

PMCID: PMC6206069 PMID: 30374021

Abstract

Due to their strong dependence on local atonic environments, NMR chemical shifts are among the most powerful tools for strucutre elucidation of powdered solids or amorphous materials. Unfortunately, using them for structure determination depends on the ability to calculate them, which comes at the cost of high accuracy first-principles calculations. Machine learning has recently emerged as a way to overcome the need for quantum chemical calculations, but for chemical shifts in solids it is hindered by the chemical and combinatorial space spanned by molecular solids, the strong dependency of chemical shifts on their environment, and the lack of an experimental database of shifts. We propose a machine learning method based on local environments to accurately predict chemical shifts of molecular solids and their polymorphs to within DFT accuracy. We also demonstrate that the trained model is able to determine, based on the match between experimentally measured and ML-predicted shifts, the structures of cocaine and the drug 4-[4-(2-adamantylcarbamoyl)-5-tert-butylpyrazol-1-yl]benzoic acid.

Solid-state nuclear magnetic resonance combined with quantum chemical shift predictions is limited by high computational cost. Here, the authors use machine learning based on local atomic environments to predict experimental chemical shifts in molecular solids with accuracy similar to density functional theory.

Introduction

Solid-state nuclear magnetic resonance (NMR) spectroscopy is among the most powerful methods for determining the atomic-level structure and dynamics of powdered and amorphous solids. Notably, solid-state NMR directly probes the local atomic environments and thus allows for characterization without the need for long-range order. This has led to its broad use today in many fields including for instance materials and pharmaceutical chemistry. In the latter the determination of structure and packing is essential to elaborate structure–property relations for formulations in the drug development process.

A revolution in solid-state NMR has occurred with the introduction of accurate methods to calculate chemical shifts^1–3, in particular using plane wave density functional theory (DFT) methods developed for periodic systems based on the projected augmented wave (PAW)/gauge including PAW (GIPAW) approach^4–6. This has enabled very rapid development of chemical shift-based NMR crystallography, which is now widely used to validate structures of molecular solids and identify known polymorphs^7–26, or more recently in combination with crystal structure prediction (CSP) protocols, to determine de novo crystal structures from powders^27–32. Recent studies also suggest that the structural accuracy of chemical shift-based solid-state NMR crystallography is at least comparable with more traditional methods, such as single crystal X-ray diffraction³³.

The power of the method arises from the fact that plane wave DFT with the GIPAW method is accurate enough to reproduce the exquisite sensitivity of chemical shifts to changes in local atomic environments. However, this approach also has severe limitations. The cubic scaling of the computational cost with system size prevents the application to larger and more complex crystals, or nonequilibrium structures. If one wanted to use more accurate ab initio calculations, the expense is prohibitive.

Machine learning (ML) is emerging as a new tool in many areas of chemical and physical science, and potentially provides a method to bridge the gap between the need for high accuracy calculations and limited computational power^34–38. Notably, prediction of chemical shifts for the specific case of proteins in solution using methods based on large experimental databases, using traditional^39–46 or machine learning approaches^47–49, have been considerably successful in predicting shifts based on local sequence and structural motifs, and are widely used today. While there are some examples of machine learned experimental and ab-initio chemical shifts of liquid and gas phase molecules^50–54, to date there is only one example of machine learning being applied to calculations of chemical shifts in solids, which deals with the specific case of silicas⁵⁵. Molecular solids are characterized by the combinatorial complexity and diversity of organic chemistry, the subtle dependence on conformations, and the long- and short-range effects of crystal packing, which leads to a considerably broader range of chemical environments and possible chemical shieldings than found, e.g., in proteins. All of these aspects, compounded by the fact that there is no extensive database of experimental chemical shifts for molecular solids, make this class of systems particularly challenging for machine learning.

Here, we develop a machine learning framework to predict chemical shifts in solids which is based on capturing the local environments of individual atoms, and which makes it well suited for the prediction of such local properties. The protocol is schematically illustrated in Fig. 1. In the absence of a database of experimental shifts, and given that experiments alone do not provide a 1:1 mapping between chemical shifts and a single atomic configuration, we train the model on DFT calculated chemical shifts for structures taken from the Cambridge Structural Database (CSD)⁵⁶ chosen to be as diverse as possible, and then show that the method can predict chemical shifts in a test set with R² coefficients between the chemical shifts calculated with DFT and with ML of 0.97 for ¹H, 0.99 for ¹³C, 0.99 for ¹⁵N, and 0.99 for ¹⁷O, corresponding to root-mean-square-errors (RMSEs) of 0.49 ppm for ¹H, 4.3 ppm for ¹³C, 13.3 ppm for ¹⁵N, and 17.7 ppm for ¹⁷O. Predicting the chemical shifts for a polymorph of cocaine, with 86 atoms in the unit cell, using the ML method takes less than a minute of central processing unit (CPU) time, thus reducing the computational time by a factor of between 5 to 10 thousand, without any significant loss in accuracy as compared to DFT.

Fig. 1 — Scheme of the machine learning model used for the chemical shift predictions

Most significantly, even though no experimental shifts were used in training, we show that the model has sufficient accuracy to be used in a chemical shift-driven NMR crystallography protocol to correctly determine, based on the match between experimentally measured and ML-predicted shifts, the correct structure of cocaine, and the drug 4-[4-(2-adamantylcarbamoyl)-5-tert-butylpyrazol-1-yl]benzoic acid (AZD8329). We also show that it is possible to calculate the NMR spectra of very large molecular crystals. For this we calculate the chemical shifts of six structures from the CSD with between 768 and 1584 atoms in the unit cells.

Results

Training and validation using DFT calculated shifts of known crystal structures

Machine learning models should by definition be trained on the property that is to be predicted. Here, that corresponds to experimental chemical shifts. However, for molecular solids there are currently only around 100 compounds with reliable crystal structures and for which assigned ¹H or ¹³C shifts have been published, despite the rapidly increasing activity of NMR in crystal structure determination. This is at least an order of magnitude too few structures to hope to determine a reliable prediction model. In this light, we note that today GIPAW chemical shift calculations can accurately reproduce experimental shifts^13,57. Thus we propose to develop a machine learning model to predict chemical shifts by training the model on a database made up of GIPAW calculated shifts from a large and diverse set of reference crystal structures. If the model can then accurately predict GIPAW chemical shifts, we hypothesize that it should also be in good agreement with experimental shifts. We also note in this context that even if there was a database of experimental shifts, there would be a challenge to machine learning related to the fact that the experiment reports on structures that include dynamics or distributions, making the connection between shifts and environments ambiguous. Learning using GIPAW calculated shifts does not suffer from this problem.

The approach we take to predicting chemical shifts in molecular solids is illustrated in Fig. 1. We use the Gaussian process regression (GPR) framework⁵⁸ to predict the chemical shift of a new atomic configuration based on a statistical model that identifies the correlations between structure and shift for a reference set of training configurations, for which the chemical shifts have been determined by a GIPAW DFT calculation. The predicted chemical shielding for a given atom is given by

σ (X) = \sum_{i} α_{i} k (X, X_{i}),

where X and $X_{i}$ correspond, respectively to a description of the chemical environment of the atom for which we are making a prediction, and that of one of the training configurations. The weights $α_{i}$ are obtained by requiring that Eq. (1) is consistent with the values computed by DFT for the reference structures. The essential ingredient that differentiates one GPR-based framework from another is the kernel function $k (X, X_{i})$ , which describes and assesses the similarity between atomic environments, and provides basis functions to approximate the target properties.

Here, our model relies on the smooth overlap of atomic positions (SOAP) kernel^59,60, in which any atomic environment is represented as a three-dimensional neighborhood density given by a superposition of Gaussians, one centered at each of the atom positions in a spherical neighborhood within a cut-off radius r_c from the core atom. This framework, combined with GPR, has been used to model the stability and properties of a number of different systems^35,59,60, and has been extended to the prediction of tensorial properties⁶¹. We can see that this choice of kernel should be particularly well adapted to predicting chemical shifts, since it describes the local environments around each atom without any simplification, and this is indeed what the chemical shift also probes, as it is determined by the screening of the nucleus from the main magnetic field by the electron density at the nucleus. Note that it should be possible to tune and train other ML methods to accurately predict chemical shifts of molecular crystals. While these possibilities will be explored in future work, the model we present here is already accurate enough to substitute for DFT calculations in chemical shift-based NMR crystallography.

As shown in Fig. 1, in the absence of an experimental database of shifts the model is developed by using a reference training set of structures for which chemical shifts are calculated with GIPAW DFT. To obtain a model which is robust and general, the training set should be as large, as reliable, and as diverse as possible. We first extract from the CSD a large set of about 61,000 structures, corresponding to all the structures in the CSD with fewer than 200 atoms, in order to make DFT chemical shift calculation affordable, and containing C and H and allowing for N and/or O, to reduce the space to organic molecular crystals (we call this set CSD-61k, see Supplementary Methods for details on the structures selection). Given that performing a GIPAW calculation for all of these structures would be prohibitively demanding, we then select a random subset of 500 structures (CSD-500, see Supplementary Note 1 and Supplementary Dataset 2) that are representative of the chemical diversity in the CSD, and we use it to test the accuracy of our model. For cross-validation and training, instead, we select 2000 structures (corresponding to about 185,000 atomic environments) out of the CSD-61k using a farthest point sampling algorithm^62,63 (CSD-2k, see Supplementary Note 2 and Supplementary Dataset 1). This step ensures near-uniform sampling of the conformational space, improving the quality of the model when using a relatively small number of reference calculations.

To avoid including spurious environments in the model, e.g., environments which might not be well described by DFT, we also automatically detect and discard from the training set atomic environments with values of the DFT calculated shifts that are anomalous based on a cross-validation procedure described in the Supplementary Methods. Note that using this unbiased statistical analysis we detected only a small fraction of environments as outliers (e.g., 211 out of 76,214 for ¹H, or 0.3%). This is discussed in detail in the Supplementary Methods. We observe that the performance of the model degrades noticeably if one does not use this procedure. This pruning as well as the parameter optimization procedure, described below, were done exclusively using cross-validation on the CSD-2k set. (Notably the test sets were not subject to any curation.)

In order to reduce the computational cost of the training and testing procedures we then finally remove from the training set all the symmetrically equivalent environments. In case of ¹H, this reduced the size of the training set from 70,000 to about 35,000 different atomic environments. (Details of the selection method and the members of the different sets used are given in the Supplementary Methods and Supplementary Note 3.)

All the atomic positions of the structures in the training and testing sets were relaxed with DFT, using the Quantum Espresso suite^64–66, prior to calculation of the chemical shieldings using the GIPAW DFT method^4,5. Note that the DFT relaxation ensures “reasonable” geometries will be used even for crystal structures containing errors (e.g., improbable ¹H positions). Parameters for the DFT calculations are given in the Supplementary Methods. The calculated chemical shieldings σ are converted to the corresponding chemical shifts δ through the relationship δ = σ_ref − σ. Here, we used a σ_ref of 30.8 ppm (for ¹H) and 169.5 ppm (for ¹³C), found through linear regression between the calculated and experimental chemical shifts for cocaine.

Figure 2 shows the chemical shift error between the DFT calculations and the ML predictions for the CSD-500 set, which is representative of the expected accuracy for the entire CSD-61k. The figure shows the overall prediction accuracy for ¹H chemical shifts as RMSE in ppm between the shifts calculated with DFT and with the protocol described above, which we refer to in the following as ShiftML, as a function of the cut-off radius (r_c) and as a function of the number of training structures included from CSD-2k. The effect of the different cut-off radii is clearly visible. For example, for r_c = 2 Å the prediction error for a small training set (<10 structures or <100 atomic environments) can be smaller than for the larger radii, but does not improve significantly with increasing size of the training set. On the contrary, for r_c = 7 Å we observe a relatively large prediction error for a small training set, but even with 2000 structures (35,000 environments), the prediction error is still decreasing. A similar behavior is observed for the prediction errors of the ¹³C, ¹⁵N, and ¹⁷O chemical shifts (see Supplementary Figures 5–8).

Fig. 2 — ¹H chemical shift prediction error of the trained model for the CSD-500 set. The RMSE prediction error between chemical shifts calculated with ShiftML and GIPAW DFT is shown for different local environment cut-off radii, and for the multi-kernel (labeled as msk), as a function of the training set size

The observed differences in the behavior of the prediction error with respect to r_c clearly indicates the influence of the different extents of the local environment on the chemical shift. Short-range interactions are sufficient to explain the rough order of magnitude of the shift, but long-range interactions are required to learn about the higher order influences of next-nearest neighbors on shifts. However, for long-range interactions, a much larger number of environments is needed in order to determine the correlation between environment and shift.

We exploit these differences to generate a combined SOAP kernel consisting of a linear combination of the single local environment kernels³⁵, with weightings of 256 (r_c = 2 Å), 128 (r_c = 3 Å), 32(r_c = 4 Å), 8 (r_c = 5 Å and r_c = 6 Å), and 1 (r_c = 7 Å). This weighting was determined by rough optimization around values inspired by previous experience³⁵, and by cross-validation on the CSD-2k training set (as described in the Supplementary Methods). It is clear that learning with the combined kernel leads consistently to lower prediction errors than any of the single kernels, although the improvement in performance varies between nuclei (see Supplementary Figures 5–8).

Figure 3a–d shows correlation plots between ¹H, ¹³C, ¹⁵N, and ¹⁷O chemical shifts calculated by DFT and by ShiftML for the CSD-500 set trained on the whole CSD-2k combined kernel. Using the combined kernel, we reach an error between ShiftML and DFT calculated chemical shifts of 0.49 ppm for ¹H (4.3 ppm for ¹³C, 13.3 ppm for ¹⁵N, and 17.7 ppm for ¹⁷O). This is very comparable with reported DFT chemical shift accuracy for ¹H of 0.33–0.43 ppm^13,57, while requiring a fraction of the computational time and cost: less than 1 CPU minute compared to ~62–150 CPU hours for DFT chemical shift calculation on structures containing 86 atoms (around 350 valence electrons) (see Supplementary Figure 4). For the other nuclei, the ML accuracy is slightly lower than reported values (1.9–2.2 ppm for ¹³C, 5.4 ppm for ¹⁵N, and 7.2 ppm for ¹⁷O)^13,57, which is not surprising as there are (currently) significantly fewer training environments for the heteronuclei than for ¹H.

Fig. 3 — Comparison of predictions from ShiftML and GIPAW DFT. Histograms and scatterplots showing the correlation between ¹H (a), ¹³C (b), ¹⁵N (c), and ¹⁷O (d) chemical shifts (shieldings) calculated with GIPAW and ShiftML. The black lines indicate a perfect correlation

The R² coefficients between the chemical shifts calculated with DFT and with ShiftML are 0.97 for ¹H, 0.99 for ¹³C, 0.99 for ¹⁵N, and 0.99 for ¹⁷O.

Note that the CSD-500 set used for testing is selected randomly from CSD-61k and not curated. Indeed, we find that many of the atomic environments in the CSD-500 set with a relatively high prediction RMSE possess either unusual cavities inside their crystal structure, possibly indicating an organic cage surrounding noncrystalline solvent or other atoms, or exhibit strongly delocalized π-bonding networks. While there is no theoretical reason preventing the machine learning model from correctly describing such environments, they are rare and not well represented within the training set. CSD-500 thus constitutes a fairly demanding test set.

Predicting shifts for polymorphs

Having evaluated the power of the trained model to predict the diverse CSD-500 set, we now look at the capacity to predict potentially subtler differences by looking at a set of polymorphs of a given structure. Figure 4 shows the correlation between the ¹H shifts calculated by GIPAW DFT and by ShiftML for 30 polymorphs of cocaine and 14 polymorphs of AZD8329, all of which were previously generated with a CSP procedure^16,27. The figure clearly shows that ShiftML is able to accurately predict the differences in ¹H chemical shift for different polymorphs.

Fig. 4 — Comparison of predictions from ShiftML and GIPAW DFT for polymorphs of cocaine and AZD8329. a Histogram showing the distribution of the differences between ¹H chemical shifts calculated with GIPAW and with ShiftML for the polymorphs of cocaine (blue), and the polymorphs of AZD8329 (orange). b Scatterplot showing the correlation between ¹H chemical shifts calculated with GIPAW and ShiftML for cocaine (blue) and AZD8329 (orange). The black line indicates a perfect correlation

We find a chemical shift prediction error (RMSE) between GIPAW DFT and ShiftML for ¹H for the cocaine polymorphs of 0.37 ppm and for AZD8329 of 0.46 ppm. Note that these values are slightly less than for the CSD-500 set, which might be expected when looking at these two fairly typical organic structures, and suggesting that the randomly selected CSD-500 indeed provides a good overall benchmark.

Note that for these cases the DFT structure optimization and GIPAW chemical shift calculation were done with a different DFT program (CASTEP)⁶⁷, which suggests that ShiftML is robust with respect to small deviations from the fully optimized structures. (As shown in the Supplementary Figure 2, performing the prediction using Quantum Espresso consistently leads to a comparable prediction accuracy.)

For the heteronuclei we obtain an RMSE between GIPAW DFT and ShiftML for cocaine of 3.8 ppm for ¹³C, 12.1 ppm for ¹⁵N, and 15.7 ppm for ¹⁷O. For AZD8329 the ¹⁵N and ¹⁷O RMSEs are proportionally larger (17.7 and 54.7 ppm), and we attribute this to the fact that the molecule contains a rather unusual C–O…H–N /C–O…H–O H-bonded dimer structure, for which the learning is thus even sparser than for the heteronuclei in general. To illustrate the unusual nature of this motif, we note that the calculated ¹⁷O shifts using DFT also change by up to 50 ppm for structures relaxed either by the CASTEP protocol used in ref. ³⁰, or the Quantum Espresso protocol used here (the RMSE between ML and DFT for the Quantum Espresso relaxed structures is reduced to 10.9 and 11.5 ppm for ¹⁵N and ¹⁷O, respectively). The RMSE of 4.0 ppm for ¹³C for AZD8329 is in line with the other systems.

Predicting experimental shifts and structure determination

Further, the significance of the method is illustrated by comparison to experimentally measured shifts. This comparison is particularly important since the training protocol did not involve any experimentally measured chemical shifts. We find that the predicted shifts are accurate enough to allow crystal structure determination for both cocaine and AZD8329 from powder samples in a chemical shift-driven NMR crystallography approach.

Figure 5a, b shows the correlation between experimentally measured ¹H chemical shifts and the ¹H chemical shifts calculated by ShiftML for crystal structures of the six molecules shown in Fig. 6 (numerical values of the experimental chemical shifts, the crystal structures, and the shifts calculated with ShiftML are given in the Supplementary Methods and Supplementary Dataset 8). The comparison between experimental and calculated ¹H chemical shifts for all crystal structures (for a total of 68 shifts) gives an error (RMSE) of 0.39 ppm and a R² coefficient of 0.99. This compares very favorably to the equivalent agreement found between GIPAW DFT and experiment which for this set of structures is a RMSE of 0.38 ppm.

Fig. 5 — Comparison of ShiftML to experimentally measured shifts. a Histogram showing the distribution of differences between experimentally measured ¹H chemical shifts and ¹H chemical shifts calculated with ShiftML for six different crystal structures (see Supplementary Methods for the structures and numerical values of the shifts). b Scatterplot showing the correlation between these experimentally measured ¹H chemical shifts and shifts calculated with ShiftML. c, d Comparison between calculated and experimental ¹H chemical shifts for the most stable structures obtained with CSP for cocaine (c) and AZD8329 (d). For each candidate structure an aggregate RMSE is shown between experimentally measured shifts and shifts calculated using either GIPAW (blue) or ShiftML (red). The gray zones represent the confidence intervals of the GIPAW DFT ¹H chemical shift RMSD, as described in the text¹³, and candidates (in c and d) that have RMSEs within this range would be determined as correct crystal structures using a chemical shift-driven solid-state NMR crystallography protocol

Fig. 6 — Chemical structures of the six molecules used to evaluate the correlation between experimentally measured ¹H chemical shifts and the shifts calculated by ShiftML. The structures are given as AZD8329 (a), theophylline (b), cocaine (c), uracil (d), 3,5-dimethylimidazole and 4,5-dimethylimidazole (e) and naproxen (f)

Figure 5c, d shows in blue the RMSE between DFT calculated and experimental ¹H chemical shifts for the 30 polymorphs predicted by CSP to have the lowest energy for cocaine and the 14 cis polymorphs of AZD8329. For both molecules the only structure in agreement with the GIPAW DFT calculations, to below a ¹H DFT chemical shift confidence interval of 0.49 ppm¹³, is the correct crystal structure. In the same plots we overlay the result where the experimental shifts are now compared to shifts predicted with ShiftML. Note that the RMSE between experiment and the predicted chemical shifts follows the same trends as for the DFT calculated shifts, and that here again the only structures below the confidence interval of 0.49 ppm are the two correct crystal structures. Note, that the cut-off of 0.49 ppm with respect to experiment has been evaluated for GIPAW DFT chemical shifts^13,57 and to rigorously repeat the CSP procedure for the ML method, the accuracy should be re-evaluated using more extensive benchmarking of ShiftML to experiment, which will be the subject of further work.

Predicting shifts for large structures

Finally, we note that the accuracy of the method does not depend on the size of the structure, and that the prediction time is linear in the number of atoms. For the structures we calculate here the prediction time actually appears nearly constant, because it is dominated by the loading time of the reference SOAP vector (see Fig. 7a). We have used this method to calculate the NMR spectra (shown in Fig. 7b–g) for six structures from the CSD having among the largest numbers of atoms per unit cell (containing only H, C, N, and O), with between 768 and 1584 atoms per unit cell. (See Supplementary Figure 10 for the chemical formula). The values of the predicted chemical shifts are given as CSD-6 in the Supplementary Note 4. Figure 7a shows the comparison between the GIPAW calculation time and the required ML prediction time. We estimate that the whole calculation would require around 16 CPU years by GIPAW. ShiftML requires less than 6 CPU minutes to calculate the shifts for all the compounds.

Fig. 7 — Chemical shift calculation times and large structures. a DFT GIPAW calculation time (blue) and ShiftML prediction time (turquoise) for different system sizes. The GIPAW DFT calculation time for the six large structures (orange) is estimated from a cubic dependence on the number of valence electrons in the structure (see Supplementary Methods). b-g 3D-shemes and ¹H NMR spectra predicted with ShiftML, of the six large molecular crystals with CSD refcodes: b CAJVUH⁶⁹, N_atoms = 828, c RUKTOI⁷⁰, N_atoms = 768, d EMEMUE⁷¹, N_atoms = 860, e GOKXOV⁷², N_atoms = 945, f HEJBUW⁷³, N_atoms = 816, and g RAYFEF⁷⁴, N_atoms = 1584

Discussion

We have presented a ML model based on local environments to predict chemical shifts of molecular solids containing HCNO to within current DFT accuracy. The R² coefficients between the chemical shifts calculated with DFT and with ShiftML are 0.97 for ¹H, 0.99 for ¹³C, 0.99 for ¹⁵N, and 0.99 for ¹⁷O. The approach allows the calculation of chemical shifts for structures with ~100 atoms in less than 1 min, reducing the computational cost of chemical shift predictions in solids by a factor of between five to ten thousand compared to current DFT chemical shift calculations, and thereby relieves a major bottleneck in the use of calculated chemical shifts for structure determination in solids.

Far from being just a benchmark of a machine-learning scheme, the method is accurate enough to be used to determine structures by comparison to experimental shifts in chemical shift-based NMR crystallography approaches to structure determination, as shown here for cocaine and AZD8329. The ML model only scales linearly with the number of atoms and, for the prediction of individual structures, is dominated by a constant I/O overhead. Here it allows the calculation of chemical shifts for a set of six structures with between 768 and 1584 atoms in their unit cells in less than 6 min (an acceleration of a factor 10⁶ for the largest structure).

The accuracy of the method is likely to increase further with the size of the training set, and subsequently with the future evolution of the accuracy of the method used to calculate the reference shifts used in training (here DFT), or by using experimental shifts if a large enough set were available. A web version based on the protocol described here is publicly available at http://shiftml.epfl.ch.

The model used here can easily be extended to organic solids including halides or other nuclei, and to network materials such as oxides, and these will be the subject of further work.

Methods

Computational details

For the SOAP kernels^59,60, each atomic environment is represented as a three-dimensional neighborhood density given by a superposition of Gaussians, one centered at each of the atom positions in a spherical neighborhood within a cut-off radius r_c from the core atom. The Gaussians have a variance $ς$ ², and a separate density is built for each atomic species. The kernel is then constructed as the symmetrized overlap between the amplitudes representing X and X′. This degree of overlap thus measures the similarity between the environments X and X′.

The SOAP and GPR parameters are given in the Supplementary Methods. SOAP-based structural kernels contain several adjustable hyper-parameters, which are discussed in ref. ⁶⁰. However, we have not systematically explored the full parametric space here, instead we chose reasonable values of the parameters without extensive fine-tuning, based on previous experience³⁵ and with some optimization by cross-validation on the CSD-2k training set (see Supplementary Methods for details). We also combine kernels computed for different cutoff radii to capture the contributions to shifts from different length scales³⁵, as is described in detail above. The calculations of the local environment, the similarity kernel and the weighted correlations were done using the glosim2 package⁶⁸.

In summary, the Supplementary Information contains details on the structure selection, the crystal structure prediction procedure, the DFT calculations, the GPR method, the SOAP kernels, the FPS algorithm, the detection procedure of unusual environments, the NMR crystallography approach, the DFT calculation time estimates, the prediction parameters optimization, the learning curves and the evaluation curves for ¹H, ¹³C, ¹⁵N, and ¹⁷O. Additionally we also provide the ShifML predicted and GIPAW chemical shieldings for all cocaine and AZD8329 polymorphs as well as the geometries, the assigned experimental chemical shifts and the chemical shifts calculated with GIPAW DFT and ShiftML for the comparison to experimentally measured shifts. The Supplementary Information also contains the chemical formula and predicted chemical shieldings of the CSD-6 set predicted with ShiftML, the Refcodes for CSD-2k and CSD-500 and the relaxed geometries and GIPAW DFT calculated chemical shifts of all investigated crystal structures.

Code availability

The machine learning code to calculate the SOAP environments, the kernels, and the chemical shifts is called glosim2, and is publicly available at https://github.com/cosmo-epfl/glosim2. The DFT codes used to optimize geometry and calculate chemical shifts are available from the corresponding developers.

Electronic supplementary material

Supplementary Information^{(2.7MB, pdf)}

41467_2018_6972_MOESM2_ESM.pdf^{(63.7KB, pdf)}

Description of Additional Supplementary Files

Supplementary Data 1^{(9.4MB, txt)}

Supplementary Data 2^{(3.6MB, txt)}

Supplementary Data 3^{(14.1KB, zip)}

Supplementary Data 4^{(62.7KB, zip)}

Supplementary Data 5^{(26.5KB, zip)}

Supplementary Data 6^{(122.3KB, zip)}

Supplementary Data 7^{(18KB, zip)}

Supplementary Data 8^{(19.6KB, zip)}

Supplementary Data 9^{(143.4KB, zip)}

Acknowledgments

We are grateful for financial support from Swiss National Science Foundation Grant no. 200021_160112. F.M. and S.D. were supported by the NCCR MARVEL, funded by the Swiss National Science Foundation. M.C. acknowledges funding by the European Research Council under the European Union’s Horizon 2020 research and innovation program (Grant agreement no. 677013-HBMAP). This work was also supported by EPFL through the use of the facilities of its Scientific IT and Application Support Center.

Author contributions

F.M.P. conceived the project, performed experiments, analyzed the data, and wrote the paper. A.H. conceived the project, performed experiments, analyzed the data, and wrote the paper. F.M. performed experiments, analyzed the data, and wrote the paper. S.D. performed experiments and analyzed the data. M.C. conceived the project, supervised the experiments, analyzed the data, and wrote the paper. L. E. conceived the project, supervised the experiments, analyzed the data, and wrote the paper.

Data availability

The authors declare that the data supporting the findings of this study are available within the paper and its supplementary information. In particular, all crystallographic structures used are referenced in the Supplementary Information and are publicly available at the Cambridge Structural Database. The relaxed crystal structures with the chemical shieldings calculated by GIPAW DFT and ShiftML are included in the supplementary information files and in the Supplementary Datasets.

Competing interests

The authors declare no competing interests

Footnotes

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Michele Ceriotti, Email: michele.ceriotti@epfl.ch.

Lyndon Emsley, Email: lyndon.emsley@epfl.ch.

Electronic supplementary material

Supplementary Information accompanies this paper at 10.1038/s41467-018-06972-x.

References

1.Dedios AC, Pearson JG, Oldfield E. Secondary and tertiary structural effects on protein nmr chemical-shifts—an abinitio approach. Science. 1993;260:1491–1496. doi: 10.1126/science.8502992. [DOI] [PubMed] [Google Scholar]
2.Facelli JC, Grant DM. Determination of molecular symmetry in crystalline naphthalene using solid-state NMR. Nature. 1993;365:325–327. doi: 10.1038/365325a0. [DOI] [PubMed] [Google Scholar]
3.Sebastiani D, Parrinello M. A new ab-initio approach for NMR chemical shifts in periodic systems. J. Phys. Chem. A. 2001;105:1951–1958. doi: 10.1021/jp002807j. [DOI] [Google Scholar]
4.Pickard, C. J. & Mauri, F. All-electron magnetic response with pseudopotentials: NMR chemical shifts. Phys. Rev. B63, 245101 (2001).
5.Yates JR, Pickard CJ, Mauri F. Calculation of NMR chemical shifts for extended systems using ultrasoft pseudopotentials. Phys. Rev. B. 2007;76:024401. doi: 10.1103/PhysRevB.76.024401. [DOI] [Google Scholar]
6.Blochl PE. Projector augmented-wave method. Phys. Rev. B Condens. Matter Mater. Phys. 1994;50:17953–17979. doi: 10.1103/PhysRevB.50.17953. [DOI] [PubMed] [Google Scholar]
7.Ochsenfeld C, Brown SP, Schnell I, Gauss J, Spiess HW. Structure assignment in the solid state by the coupling of quantum chemical calculations with NMR experiments: a columnar hexabenzocoronene derivative. J. Am. Chem. Soc. 2001;123:2597–2606. doi: 10.1021/ja0021823. [DOI] [PubMed] [Google Scholar]
8.Harris RK. NMR crystallography: the use of chemical shifts. Solid State Sci. 2004;6:1025–1037. doi: 10.1016/j.solidstatesciences.2004.03.040. [DOI] [Google Scholar]
9.Harper JK, Grant DM. Enhancing crystal-structure prediction with NMR tensor data. Cryst. Growth Des. 2006;6:2315–2321. doi: 10.1021/cg060244g. [DOI] [Google Scholar]
10.Harris RK. Applications of solid-state NMR to pharmaceutical polymorphism and related matters. J. Pharm. Pharmacol. 2007;59:225–239. doi: 10.1211/jpp.59.2.0009. [DOI] [PubMed] [Google Scholar]
11.Othman A, Evans JS, Evans IR, Harris RK, Hodgkinson P. Structural study of polymorphs and solvates of finasteride. J. Pharm. Sci. 2007;96:1380–1397. doi: 10.1002/jps.20940. [DOI] [PubMed] [Google Scholar]
12.Salager E, Stein RS, Pickard CJ, Elena B, Emsley L. Powder NMR crystallography of thymol. Phys. Chem. Chem. Phys. 2009;11:2610–2621. doi: 10.1039/b821018g. [DOI] [PubMed] [Google Scholar]
13.Salager E, et al. Powder crystallography by combined crystal structure prediction and high-resolution 1H solid-state NMR spectroscopy. J. Am. Chem. Soc. 2010;132:2564–2566. doi: 10.1021/ja909449k. [DOI] [PubMed] [Google Scholar]
14.Webber AL, Emsley L, Claramunt RM, Brown SP. NMR crystallography of campho[2,3-c]pyrazole (Z’ = 6): combining high-resolution 1H-13C solid-state MAS NMR spectroscopy and GIPAW chemical-shift calculations. J. Phys. Chem. A. 2010;114:10435–10442. doi: 10.1021/jp104901j. [DOI] [PubMed] [Google Scholar]
15.Dudenko D, et al. A strategy for revealing the packing in semicrystalline pi-conjugated polymers: crystal structure of bulk poly-3-hexyl-thiophene (P3HT) Angew. Chem. Int. Ed. Engl. 2012;51:11068–11072. doi: 10.1002/anie.201205075. [DOI] [PubMed] [Google Scholar]
16.Baias M, et al. Powder crystallography of pharmaceutical materials by combined crystal structure prediction and solid-state 1H NMR spectroscopy. Phys. Chem. Chem. Phys. 2013;15:8069–8080. doi: 10.1039/c3cp41095a. [DOI] [PubMed] [Google Scholar]
17.Pawlak T, Jaworska M, Potrzebowski MJ. NMR crystallography of alpha-poly(l-lactide) Phys. Chem. Chem. Phys. 2013;15:3137–3145. doi: 10.1039/c2cp43174b. [DOI] [PubMed] [Google Scholar]
18.Santos SM, Rocha J, Mafra L. NMR crystallography: toward chemical shift-driven crystal structure determination of the beta-lactam antibiotic amoxicillin trihydrate. Cryst. Growth Des. 2013;13:2390–2395. doi: 10.1021/cg4002785. [DOI] [Google Scholar]
19.Ludeker D, Brunklaus G. NMR crystallography of ezetimibe co-crystals. Solid. State Nucl. Magn. Reson. 2015;65:29–40. doi: 10.1016/j.ssnmr.2014.11.002. [DOI] [PubMed] [Google Scholar]
20.Paluch P, Pawlak T, Oszajca M, Lasocha W, Potrzebowski MJ. Fine refinement of solid state structure of racemic form of phospho-tyrosine employing NMR crystallography approach. Solid. State Nucl. Magn. Reson. 2015;65:2–11. doi: 10.1016/j.ssnmr.2014.08.002. [DOI] [PubMed] [Google Scholar]
21.Watts AE, Maruyoshi K, Hughes CE, Brown SP, Harris KDM. Combining the advantages of powder X-ray diffraction and NMR crystallography in structure determination of the pharmaceutical material cimetidine hydrochloride. Cryst. Growth Des. 2016;16:1798–1804. doi: 10.1021/acs.cgd.6b00016. [DOI] [Google Scholar]
22.Widdifield CM, Robson H, Hodgkinson P. Furosemide’s one little hydrogen atom: NMR crystallography structure verification of powdered molecular organics. Chem. Commun. 2016;52:6685–6688. doi: 10.1039/C6CC02171A. [DOI] [PubMed] [Google Scholar]
23.Mali G. Ab initio crystal structure prediction of magnesium (poly)sulfides and calculation of their NMR parameters. Acta Crystallogr. Sect. C Struct. Chem. 2017;73:229–233. doi: 10.1107/S2053229617000687. [DOI] [PubMed] [Google Scholar]
24.Harris RK, Joyce SA, Pickard CJ, Cadars S, Emsley L. Assigning carbon-13 NMR spectra to crystal structures by the INADEQUATE pulse sequence and first principles computation: a case study of two forms of testosterone. Phys. Chem. Chem. Phys. 2006;8:137–143. doi: 10.1039/B513392K. [DOI] [PubMed] [Google Scholar]
25.Mifsud N, Elena B, Pickard CJ, Lesage A, Emsley L. Assigning powders to crystal structures by high-resolution (1)H-(1)H double quantum and (1)H-(13)C J-INEPT solid-state NMR spectroscopy and first principles computation. A case study of penicillin G. Phys. Chem. Chem. Phys. 2006;8:3418–3422. doi: 10.1039/B605227D. [DOI] [PubMed] [Google Scholar]
26.Heider EM, Harper JK, Grant DM. Structural characterization of an anhydrous polymorph of paclitaxel by solid-state NMR. Phys. Chem. Chem. Phys. 2007;9:6083–6097. doi: 10.1039/b711027h. [DOI] [PubMed] [Google Scholar]
27.Baias M, et al. De novo determination of the crystal structure of a large drug molecule by crystal structure prediction-based powder NMR crystallography. J. Am. Chem. Soc. 2013;135:17501–17507. doi: 10.1021/ja4088874. [DOI] [PubMed] [Google Scholar]
28.Fernandes JA, Sardo M, Mafra L, Choquesillo-Lazarte D, Masciocchi N. X-ray and NMR crystallography studies of novel theophylline cocrystals prepared by liquid assisted grinding. Cryst. Growth Des. 2015;15:3674–3683. doi: 10.1021/acs.cgd.5b00279. [DOI] [Google Scholar]
29.Leclaire J, et al. Structure elucidation of a complex CO2-based organic framework material by NMR crystallography. Chem. Sci. 2016;7:4379–4390. doi: 10.1039/C5SC03810C. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Selent M, et al. Clathrate structure determination by combining crystal structure prediction with computational and experimental (129) Xe NMR spectroscopy. Chemistry. 2017;23:5258–5269. doi: 10.1002/chem.201604797. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Widdifield CM, et al. Does Z’ equal 1 or 2? Enhanced powder NMR crystallography verification of a disordered room temperature crystal structure of a p38 inhibitor for chronic obstructive pulmonary disease. Phys. Chem. Chem. Phys. 2017;19:16650–16661. doi: 10.1039/C7CP02349A. [DOI] [PubMed] [Google Scholar]
32.Nilsson Lill SO, et al. Elucidating an amorphous form stabilization mechanism for tenapanor hydrochloride: crystal structure analysis using X-ray diffraction, NMR crystallography, and molecular modeling. Mol. Pharm. 2018;15:1476–1487. doi: 10.1021/acs.molpharmaceut.7b01047. [DOI] [PubMed] [Google Scholar]
33.Hofstetter A, Emsley L. Positional variance in NMR crystallography. J. Am. Chem. Soc. 2017;139:2573–2576. doi: 10.1021/jacs.6b12705. [DOI] [PubMed] [Google Scholar]
34.Curtarolo S, et al. The high-throughput highway to computational materials design. Nat. Mater. 2013;12:191–201. doi: 10.1038/nmat3568. [DOI] [PubMed] [Google Scholar]
35.Bartok AP, et al. Machine learning unifies the modeling of materials and molecules. Sci. Adv. 2017;3:e1701816. doi: 10.1126/sciadv.1701816. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Xue D, et al. Accelerated search for materials with targeted properties by adaptive design. Nat. Commun. 2016;7:11241. doi: 10.1038/ncomms11241. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput. Mater. 2, 16028 (2016).
38.Rupp M, Tkatchenko A, Muller KR, von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 2012;108:058301. doi: 10.1103/PhysRevLett.108.058301. [DOI] [PubMed] [Google Scholar]
39.Shen Y, Bax A. Protein backbone chemical shifts predicted from searching a database for torsion angle and sequence homology. J. Biomol. NMR. 2007;38:289–302. doi: 10.1007/s10858-007-9166-6. [DOI] [PubMed] [Google Scholar]
40.Neal S, Nip AM, Zhang HY, Wishart DS. Rapid and accurate calculation of protein H-1, C-13 and N-15 chemical shifts. J. Biomol. NMR. 2003;26:215–240. doi: 10.1023/A:1023812930288. [DOI] [PubMed] [Google Scholar]
41.Wishart DS, Watson MS, Boyko RF, Sykes BD. Automated H-1 and C-13 chemical shift prediction using the BioMagResBank. J. Biomol. NMR. 1997;10:329–336. doi: 10.1023/A:1018373822088. [DOI] [PubMed] [Google Scholar]
42.Iwadate M, Asakura T, Williamson MP. C alpha and C beta carbon-13 chemical shifts in proteins from an empirical database. J. Biomol. NMR. 1999;13:199–211. doi: 10.1023/A:1008376710086. [DOI] [PubMed] [Google Scholar]
43.Xu XP, Case DA. Automated prediction of (15)N, (13)C(alpha), (13)C(beta) and (13)C‘ chemical shifts in proteins using a density functional database. J. Biomol. NMR. 2001;21:321–333. doi: 10.1023/A:1013324104681. [DOI] [PubMed] [Google Scholar]
44.Moon S, Case DA. A new model for chemical shifts of amide hydrogens in proteins. J. Biomol. NMR. 2007;38:139–150. doi: 10.1007/s10858-007-9156-8. [DOI] [PubMed] [Google Scholar]
45.Vila JA, Arnautova YA, Martin OA, Scheraga HA. Quantum-mechanics-derived 13Calpha chemical shift server (CheShift) for protein structure validation. Proc. Natl Acad. Sci. USA. 2009;106:16972–16977. doi: 10.1073/pnas.0908833106. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Kohlhoff KJ, Robustelli P, Cavalli A, Salvatella X, Vendruscolo M. Fast and accurate predictions of protein NMR chemical shifts from interatomic distances. J. Am. Chem. Soc. 2009;131:13894–13895. doi: 10.1021/ja903772t. [DOI] [PubMed] [Google Scholar]
47.Meiler J. PROSHIFT: protein chemical shift prediction using artificial neural networks. J. Biomol. NMR. 2003;26:25–37. doi: 10.1023/A:1023060720156. [DOI] [PubMed] [Google Scholar]
48.Han B, Liu Y, Ginzinger SW, Wishart DS. SHIFTX2: significantly improved protein chemical shift prediction. J. Biomol. NMR. 2011;50:43–57. doi: 10.1007/s10858-011-9478-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Shen Y, Bax A. SPARTA+: a modest improvement in empirical NMR chemical shift prediction by means of an artificial neural network. J. Biomol. NMR. 2010;48:13–22. doi: 10.1007/s10858-010-9433-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Rupp M, Ramakrishnan R, von Lilienfeld OA. Machine learning for quantum mechanical properties of atoms in molecules. J. Phys. Chem. Lett. 2015;6:3309–3313. doi: 10.1021/acs.jpclett.5b01456. [DOI] [Google Scholar]
51.Blinov K, et al. Performance validation of neural network based 13C NMR prediction using a publicly available data source. J. Chem. Inf. Model. 2008;48:550–555. doi: 10.1021/ci700363r. [DOI] [PubMed] [Google Scholar]
52.Smurnyy YD, Blinov KA, Churanova TS, Elyashberg ME, Williams AJ. Toward more reliable 13C and 1H chemical shift prediction: a systematic comparison of neural-network and least-squares regression based approaches. J. Chem. Inf. Model. 2008;48:128–134. doi: 10.1021/ci700256n. [DOI] [PubMed] [Google Scholar]
53.Aires-de-Sousa J, Hemmer MC, Gasteiger J. Prediction of 1H NMR chemical shifts using neural networks. Anal. Chem. 2002;74:80–90. doi: 10.1021/ac010737m. [DOI] [PubMed] [Google Scholar]
54.Kuhn S, Egert B, Neumann S, Steinbeck C. Building blocks for automated elucidation of metabolites: machine learning methods for NMR prediction. BMC Bioinforma. 2008;9:400. doi: 10.1186/1471-2105-9-400. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Cuny J, Xie Y, Pickard CJ, Hassanali AA. Ab initio quality NMR parameters in solid-state materials using a high-dimensional neural-network representation. J. Chem. Theory Comput. 2016;12:765–773. doi: 10.1021/acs.jctc.5b01006. [DOI] [PubMed] [Google Scholar]
56.Groom CR, Bruno IJ, Lightfoot MP, Ward SC. The Cambridge Structural Database. Acta Crystallogr. B. 2016;72:171–179. doi: 10.1107/S2052520616003954. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Hartman JD, Kudla RA, Day GM, Mueller LJ, Beran GJ. Benchmark fragment-based (1)H, (13)C, (15)N and (17)O chemical shift predictions in molecular crystals. Phys. Chem. Chem. Phys. 2016;18:21686–21709. doi: 10.1039/C6CP01831A. [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Rasmussen, C. E. & Williams, C. K. Gaussian Processes for Machine Learning. Vol. 1 (MIT Press, Cambridge, 2006).
59.Bartók AP, Kondor R, Csányi G. On representing chemical environments. Phys. Rev. B. 2013;87:1–16. [Google Scholar]
60.De S, Bartok AP, Csanyi G, Ceriotti M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 2016;18:13754–13769. doi: 10.1039/C6CP00415F. [DOI] [PubMed] [Google Scholar]
61.Grisafi A, Wilkins DM, Csanyi G, Ceriotti M. Symmetry-adapted machine learning for tensorial properties of atomistic systems. Phys. Rev. Lett. 2018;120:036002. doi: 10.1103/PhysRevLett.120.036002. [DOI] [PubMed] [Google Scholar]
62.Ceriotti M, Tribello GA, Parrinello M. Demonstrating the transferability and the descriptive power of sketch-map. J. Chem. Theory Comput. 2013;9:1521–1532. doi: 10.1021/ct3010563. [DOI] [PubMed] [Google Scholar]
63.Campello RJGB, Moulavi D, Zimek A, Sander J. Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data. 2015;10:5. doi: 10.1145/2733381. [DOI] [Google Scholar]
64.Giannozzi P, et al. Advanced capabilities for materials modelling with Quantum ESPRESSO. J. Phys. Condens. Matter. 2017;29:465901. doi: 10.1088/1361-648X/aa8f79. [DOI] [PubMed] [Google Scholar]
65.Giannozzi P, et al. QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials. J. Phys. Condens. Matter. 2009;21:395502. doi: 10.1088/0953-8984/21/39/395502. [DOI] [PubMed] [Google Scholar]
66.Varini N, Ceresoli D, Martin-Samos L, Girotto I, Cavazzoni C. Enhancement of DFT-calculations at petascale: nuclear magnetic resonance, hybrid density functional theory and Car–Parrinello calculations. Comput. Phys. Commun. 2013;184:1827–1833. doi: 10.1016/j.cpc.2013.03.003. [DOI] [Google Scholar]
67.Clark SJ, et al. First principles methods using CASTEP. Z. Krist. Cryst. Mater. 2005;220:567–570. [Google Scholar]
68.F. Musil, S. De & M. Cerrioti. Glosim2 package, https://github.com/cosmo-epfl/glosim2 (2017).
69.Arico-Muendel CC, et al. Orally active fumagillin analogues: transformations of a reactive warhead in the gastric environment. ACS Med. Chem. Lett. 2013;4:381–386. doi: 10.1021/ml3003633. [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Dao HT, Li C, Michaudel Q, Maxwell BD, Baran PS. Hydromethylation of unactivated olefins. J. Am. Chem. Soc. 2015;137:8046–8049. doi: 10.1021/jacs.5b05144. [DOI] [PMC free article] [PubMed] [Google Scholar]
71.Garozzo D, et al. Inclusion networks of a calix[5]arene-based exoditopic receptor and long-chain alkyldiammonium ions. Org. Lett. 2003;5:4025–4028. doi: 10.1021/ol035310b. [DOI] [PubMed] [Google Scholar]
72.Bats, J. W. CSD Commun. (2010).
73.Huang GB, et al. Selective recognition of aromatic hydrocarbons by endo-functionalized molecular tubes via C/N-H center dot center dot center dot pi interactions. Chin. Chem. Lett. 2018;29:91–94. doi: 10.1016/j.cclet.2017.07.005. [DOI] [Google Scholar]
74.Plater MJ, Harrison WA, Machado de los Toyos L, Hendry L. The consistent hexameric paddle-wheel crystallisation motif of a family of 2,4-bis(n-alkylamino)nitrobenzenes: alkyl=pentyl, hexyl, heptyl and octyl. J. Chem. Res. 2017;41:235–238. doi: 10.3184/174751917X14902201357356. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(2.7MB, pdf)}

41467_2018_6972_MOESM2_ESM.pdf^{(63.7KB, pdf)}

Description of Additional Supplementary Files

Supplementary Data 1^{(9.4MB, txt)}

Supplementary Data 2^{(3.6MB, txt)}

Supplementary Data 3^{(14.1KB, zip)}

Supplementary Data 4^{(62.7KB, zip)}

Supplementary Data 5^{(26.5KB, zip)}

Supplementary Data 6^{(122.3KB, zip)}

Supplementary Data 7^{(18KB, zip)}

Supplementary Data 8^{(19.6KB, zip)}

Supplementary Data 9^{(143.4KB, zip)}

Data Availability Statement

[CR1] 1.Dedios AC, Pearson JG, Oldfield E. Secondary and tertiary structural effects on protein nmr chemical-shifts—an abinitio approach. Science. 1993;260:1491–1496. doi: 10.1126/science.8502992. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Facelli JC, Grant DM. Determination of molecular symmetry in crystalline naphthalene using solid-state NMR. Nature. 1993;365:325–327. doi: 10.1038/365325a0. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Sebastiani D, Parrinello M. A new ab-initio approach for NMR chemical shifts in periodic systems. J. Phys. Chem. A. 2001;105:1951–1958. doi: 10.1021/jp002807j. [DOI] [Google Scholar]

[CR4] 4.Pickard, C. J. & Mauri, F. All-electron magnetic response with pseudopotentials: NMR chemical shifts. Phys. Rev. B63, 245101 (2001).

[CR5] 5.Yates JR, Pickard CJ, Mauri F. Calculation of NMR chemical shifts for extended systems using ultrasoft pseudopotentials. Phys. Rev. B. 2007;76:024401. doi: 10.1103/PhysRevB.76.024401. [DOI] [Google Scholar]

[CR6] 6.Blochl PE. Projector augmented-wave method. Phys. Rev. B Condens. Matter Mater. Phys. 1994;50:17953–17979. doi: 10.1103/PhysRevB.50.17953. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Ochsenfeld C, Brown SP, Schnell I, Gauss J, Spiess HW. Structure assignment in the solid state by the coupling of quantum chemical calculations with NMR experiments: a columnar hexabenzocoronene derivative. J. Am. Chem. Soc. 2001;123:2597–2606. doi: 10.1021/ja0021823. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Harris RK. NMR crystallography: the use of chemical shifts. Solid State Sci. 2004;6:1025–1037. doi: 10.1016/j.solidstatesciences.2004.03.040. [DOI] [Google Scholar]

[CR9] 9.Harper JK, Grant DM. Enhancing crystal-structure prediction with NMR tensor data. Cryst. Growth Des. 2006;6:2315–2321. doi: 10.1021/cg060244g. [DOI] [Google Scholar]

[CR10] 10.Harris RK. Applications of solid-state NMR to pharmaceutical polymorphism and related matters. J. Pharm. Pharmacol. 2007;59:225–239. doi: 10.1211/jpp.59.2.0009. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Othman A, Evans JS, Evans IR, Harris RK, Hodgkinson P. Structural study of polymorphs and solvates of finasteride. J. Pharm. Sci. 2007;96:1380–1397. doi: 10.1002/jps.20940. [DOI] [PubMed] [Google Scholar]

[CR12] 12.Salager E, Stein RS, Pickard CJ, Elena B, Emsley L. Powder NMR crystallography of thymol. Phys. Chem. Chem. Phys. 2009;11:2610–2621. doi: 10.1039/b821018g. [DOI] [PubMed] [Google Scholar]

[CR13] 13.Salager E, et al. Powder crystallography by combined crystal structure prediction and high-resolution 1H solid-state NMR spectroscopy. J. Am. Chem. Soc. 2010;132:2564–2566. doi: 10.1021/ja909449k. [DOI] [PubMed] [Google Scholar]

[CR14] 14.Webber AL, Emsley L, Claramunt RM, Brown SP. NMR crystallography of campho[2,3-c]pyrazole (Z’ = 6): combining high-resolution 1H-13C solid-state MAS NMR spectroscopy and GIPAW chemical-shift calculations. J. Phys. Chem. A. 2010;114:10435–10442. doi: 10.1021/jp104901j. [DOI] [PubMed] [Google Scholar]

[CR15] 15.Dudenko D, et al. A strategy for revealing the packing in semicrystalline pi-conjugated polymers: crystal structure of bulk poly-3-hexyl-thiophene (P3HT) Angew. Chem. Int. Ed. Engl. 2012;51:11068–11072. doi: 10.1002/anie.201205075. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Baias M, et al. Powder crystallography of pharmaceutical materials by combined crystal structure prediction and solid-state 1H NMR spectroscopy. Phys. Chem. Chem. Phys. 2013;15:8069–8080. doi: 10.1039/c3cp41095a. [DOI] [PubMed] [Google Scholar]

[CR17] 17.Pawlak T, Jaworska M, Potrzebowski MJ. NMR crystallography of alpha-poly(l-lactide) Phys. Chem. Chem. Phys. 2013;15:3137–3145. doi: 10.1039/c2cp43174b. [DOI] [PubMed] [Google Scholar]

[CR18] 18.Santos SM, Rocha J, Mafra L. NMR crystallography: toward chemical shift-driven crystal structure determination of the beta-lactam antibiotic amoxicillin trihydrate. Cryst. Growth Des. 2013;13:2390–2395. doi: 10.1021/cg4002785. [DOI] [Google Scholar]

[CR19] 19.Ludeker D, Brunklaus G. NMR crystallography of ezetimibe co-crystals. Solid. State Nucl. Magn. Reson. 2015;65:29–40. doi: 10.1016/j.ssnmr.2014.11.002. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Paluch P, Pawlak T, Oszajca M, Lasocha W, Potrzebowski MJ. Fine refinement of solid state structure of racemic form of phospho-tyrosine employing NMR crystallography approach. Solid. State Nucl. Magn. Reson. 2015;65:2–11. doi: 10.1016/j.ssnmr.2014.08.002. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Watts AE, Maruyoshi K, Hughes CE, Brown SP, Harris KDM. Combining the advantages of powder X-ray diffraction and NMR crystallography in structure determination of the pharmaceutical material cimetidine hydrochloride. Cryst. Growth Des. 2016;16:1798–1804. doi: 10.1021/acs.cgd.6b00016. [DOI] [Google Scholar]

[CR22] 22.Widdifield CM, Robson H, Hodgkinson P. Furosemide’s one little hydrogen atom: NMR crystallography structure verification of powdered molecular organics. Chem. Commun. 2016;52:6685–6688. doi: 10.1039/C6CC02171A. [DOI] [PubMed] [Google Scholar]

[CR23] 23.Mali G. Ab initio crystal structure prediction of magnesium (poly)sulfides and calculation of their NMR parameters. Acta Crystallogr. Sect. C Struct. Chem. 2017;73:229–233. doi: 10.1107/S2053229617000687. [DOI] [PubMed] [Google Scholar]

[CR24] 24.Harris RK, Joyce SA, Pickard CJ, Cadars S, Emsley L. Assigning carbon-13 NMR spectra to crystal structures by the INADEQUATE pulse sequence and first principles computation: a case study of two forms of testosterone. Phys. Chem. Chem. Phys. 2006;8:137–143. doi: 10.1039/B513392K. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Mifsud N, Elena B, Pickard CJ, Lesage A, Emsley L. Assigning powders to crystal structures by high-resolution (1)H-(1)H double quantum and (1)H-(13)C J-INEPT solid-state NMR spectroscopy and first principles computation. A case study of penicillin G. Phys. Chem. Chem. Phys. 2006;8:3418–3422. doi: 10.1039/B605227D. [DOI] [PubMed] [Google Scholar]

[CR26] 26.Heider EM, Harper JK, Grant DM. Structural characterization of an anhydrous polymorph of paclitaxel by solid-state NMR. Phys. Chem. Chem. Phys. 2007;9:6083–6097. doi: 10.1039/b711027h. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Baias M, et al. De novo determination of the crystal structure of a large drug molecule by crystal structure prediction-based powder NMR crystallography. J. Am. Chem. Soc. 2013;135:17501–17507. doi: 10.1021/ja4088874. [DOI] [PubMed] [Google Scholar]

[CR28] 28.Fernandes JA, Sardo M, Mafra L, Choquesillo-Lazarte D, Masciocchi N. X-ray and NMR crystallography studies of novel theophylline cocrystals prepared by liquid assisted grinding. Cryst. Growth Des. 2015;15:3674–3683. doi: 10.1021/acs.cgd.5b00279. [DOI] [Google Scholar]

[CR29] 29.Leclaire J, et al. Structure elucidation of a complex CO2-based organic framework material by NMR crystallography. Chem. Sci. 2016;7:4379–4390. doi: 10.1039/C5SC03810C. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Selent M, et al. Clathrate structure determination by combining crystal structure prediction with computational and experimental (129) Xe NMR spectroscopy. Chemistry. 2017;23:5258–5269. doi: 10.1002/chem.201604797. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Widdifield CM, et al. Does Z’ equal 1 or 2? Enhanced powder NMR crystallography verification of a disordered room temperature crystal structure of a p38 inhibitor for chronic obstructive pulmonary disease. Phys. Chem. Chem. Phys. 2017;19:16650–16661. doi: 10.1039/C7CP02349A. [DOI] [PubMed] [Google Scholar]

[CR32] 32.Nilsson Lill SO, et al. Elucidating an amorphous form stabilization mechanism for tenapanor hydrochloride: crystal structure analysis using X-ray diffraction, NMR crystallography, and molecular modeling. Mol. Pharm. 2018;15:1476–1487. doi: 10.1021/acs.molpharmaceut.7b01047. [DOI] [PubMed] [Google Scholar]

[CR33] 33.Hofstetter A, Emsley L. Positional variance in NMR crystallography. J. Am. Chem. Soc. 2017;139:2573–2576. doi: 10.1021/jacs.6b12705. [DOI] [PubMed] [Google Scholar]

[CR34] 34.Curtarolo S, et al. The high-throughput highway to computational materials design. Nat. Mater. 2013;12:191–201. doi: 10.1038/nmat3568. [DOI] [PubMed] [Google Scholar]

[CR35] 35.Bartok AP, et al. Machine learning unifies the modeling of materials and molecules. Sci. Adv. 2017;3:e1701816. doi: 10.1126/sciadv.1701816. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Xue D, et al. Accelerated search for materials with targeted properties by adaptive design. Nat. Commun. 2016;7:11241. doi: 10.1038/ncomms11241. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput. Mater. 2, 16028 (2016).

[CR38] 38.Rupp M, Tkatchenko A, Muller KR, von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 2012;108:058301. doi: 10.1103/PhysRevLett.108.058301. [DOI] [PubMed] [Google Scholar]

[CR39] 39.Shen Y, Bax A. Protein backbone chemical shifts predicted from searching a database for torsion angle and sequence homology. J. Biomol. NMR. 2007;38:289–302. doi: 10.1007/s10858-007-9166-6. [DOI] [PubMed] [Google Scholar]

[CR40] 40.Neal S, Nip AM, Zhang HY, Wishart DS. Rapid and accurate calculation of protein H-1, C-13 and N-15 chemical shifts. J. Biomol. NMR. 2003;26:215–240. doi: 10.1023/A:1023812930288. [DOI] [PubMed] [Google Scholar]

[CR41] 41.Wishart DS, Watson MS, Boyko RF, Sykes BD. Automated H-1 and C-13 chemical shift prediction using the BioMagResBank. J. Biomol. NMR. 1997;10:329–336. doi: 10.1023/A:1018373822088. [DOI] [PubMed] [Google Scholar]

[CR42] 42.Iwadate M, Asakura T, Williamson MP. C alpha and C beta carbon-13 chemical shifts in proteins from an empirical database. J. Biomol. NMR. 1999;13:199–211. doi: 10.1023/A:1008376710086. [DOI] [PubMed] [Google Scholar]

[CR43] 43.Xu XP, Case DA. Automated prediction of (15)N, (13)C(alpha), (13)C(beta) and (13)C‘ chemical shifts in proteins using a density functional database. J. Biomol. NMR. 2001;21:321–333. doi: 10.1023/A:1013324104681. [DOI] [PubMed] [Google Scholar]

[CR44] 44.Moon S, Case DA. A new model for chemical shifts of amide hydrogens in proteins. J. Biomol. NMR. 2007;38:139–150. doi: 10.1007/s10858-007-9156-8. [DOI] [PubMed] [Google Scholar]

[CR45] 45.Vila JA, Arnautova YA, Martin OA, Scheraga HA. Quantum-mechanics-derived 13Calpha chemical shift server (CheShift) for protein structure validation. Proc. Natl Acad. Sci. USA. 2009;106:16972–16977. doi: 10.1073/pnas.0908833106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Kohlhoff KJ, Robustelli P, Cavalli A, Salvatella X, Vendruscolo M. Fast and accurate predictions of protein NMR chemical shifts from interatomic distances. J. Am. Chem. Soc. 2009;131:13894–13895. doi: 10.1021/ja903772t. [DOI] [PubMed] [Google Scholar]

[CR47] 47.Meiler J. PROSHIFT: protein chemical shift prediction using artificial neural networks. J. Biomol. NMR. 2003;26:25–37. doi: 10.1023/A:1023060720156. [DOI] [PubMed] [Google Scholar]

[CR48] 48.Han B, Liu Y, Ginzinger SW, Wishart DS. SHIFTX2: significantly improved protein chemical shift prediction. J. Biomol. NMR. 2011;50:43–57. doi: 10.1007/s10858-011-9478-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Shen Y, Bax A. SPARTA+: a modest improvement in empirical NMR chemical shift prediction by means of an artificial neural network. J. Biomol. NMR. 2010;48:13–22. doi: 10.1007/s10858-010-9433-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR50] 50.Rupp M, Ramakrishnan R, von Lilienfeld OA. Machine learning for quantum mechanical properties of atoms in molecules. J. Phys. Chem. Lett. 2015;6:3309–3313. doi: 10.1021/acs.jpclett.5b01456. [DOI] [Google Scholar]

[CR51] 51.Blinov K, et al. Performance validation of neural network based 13C NMR prediction using a publicly available data source. J. Chem. Inf. Model. 2008;48:550–555. doi: 10.1021/ci700363r. [DOI] [PubMed] [Google Scholar]

[CR52] 52.Smurnyy YD, Blinov KA, Churanova TS, Elyashberg ME, Williams AJ. Toward more reliable 13C and 1H chemical shift prediction: a systematic comparison of neural-network and least-squares regression based approaches. J. Chem. Inf. Model. 2008;48:128–134. doi: 10.1021/ci700256n. [DOI] [PubMed] [Google Scholar]

[CR53] 53.Aires-de-Sousa J, Hemmer MC, Gasteiger J. Prediction of 1H NMR chemical shifts using neural networks. Anal. Chem. 2002;74:80–90. doi: 10.1021/ac010737m. [DOI] [PubMed] [Google Scholar]

[CR54] 54.Kuhn S, Egert B, Neumann S, Steinbeck C. Building blocks for automated elucidation of metabolites: machine learning methods for NMR prediction. BMC Bioinforma. 2008;9:400. doi: 10.1186/1471-2105-9-400. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR55] 55.Cuny J, Xie Y, Pickard CJ, Hassanali AA. Ab initio quality NMR parameters in solid-state materials using a high-dimensional neural-network representation. J. Chem. Theory Comput. 2016;12:765–773. doi: 10.1021/acs.jctc.5b01006. [DOI] [PubMed] [Google Scholar]

[CR56] 56.Groom CR, Bruno IJ, Lightfoot MP, Ward SC. The Cambridge Structural Database. Acta Crystallogr. B. 2016;72:171–179. doi: 10.1107/S2052520616003954. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR57] 57.Hartman JD, Kudla RA, Day GM, Mueller LJ, Beran GJ. Benchmark fragment-based (1)H, (13)C, (15)N and (17)O chemical shift predictions in molecular crystals. Phys. Chem. Chem. Phys. 2016;18:21686–21709. doi: 10.1039/C6CP01831A. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR58] 58.Rasmussen, C. E. & Williams, C. K. Gaussian Processes for Machine Learning. Vol. 1 (MIT Press, Cambridge, 2006).

[CR59] 59.Bartók AP, Kondor R, Csányi G. On representing chemical environments. Phys. Rev. B. 2013;87:1–16. [Google Scholar]

[CR60] 60.De S, Bartok AP, Csanyi G, Ceriotti M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 2016;18:13754–13769. doi: 10.1039/C6CP00415F. [DOI] [PubMed] [Google Scholar]

[CR61] 61.Grisafi A, Wilkins DM, Csanyi G, Ceriotti M. Symmetry-adapted machine learning for tensorial properties of atomistic systems. Phys. Rev. Lett. 2018;120:036002. doi: 10.1103/PhysRevLett.120.036002. [DOI] [PubMed] [Google Scholar]

[CR62] 62.Ceriotti M, Tribello GA, Parrinello M. Demonstrating the transferability and the descriptive power of sketch-map. J. Chem. Theory Comput. 2013;9:1521–1532. doi: 10.1021/ct3010563. [DOI] [PubMed] [Google Scholar]

[CR63] 63.Campello RJGB, Moulavi D, Zimek A, Sander J. Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data. 2015;10:5. doi: 10.1145/2733381. [DOI] [Google Scholar]

[CR64] 64.Giannozzi P, et al. Advanced capabilities for materials modelling with Quantum ESPRESSO. J. Phys. Condens. Matter. 2017;29:465901. doi: 10.1088/1361-648X/aa8f79. [DOI] [PubMed] [Google Scholar]

[CR65] 65.Giannozzi P, et al. QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials. J. Phys. Condens. Matter. 2009;21:395502. doi: 10.1088/0953-8984/21/39/395502. [DOI] [PubMed] [Google Scholar]

[CR66] 66.Varini N, Ceresoli D, Martin-Samos L, Girotto I, Cavazzoni C. Enhancement of DFT-calculations at petascale: nuclear magnetic resonance, hybrid density functional theory and Car–Parrinello calculations. Comput. Phys. Commun. 2013;184:1827–1833. doi: 10.1016/j.cpc.2013.03.003. [DOI] [Google Scholar]

[CR67] 67.Clark SJ, et al. First principles methods using CASTEP. Z. Krist. Cryst. Mater. 2005;220:567–570. [Google Scholar]

[CR68] 68.F. Musil, S. De & M. Cerrioti. Glosim2 package, https://github.com/cosmo-epfl/glosim2 (2017).

[CR69] 69.Arico-Muendel CC, et al. Orally active fumagillin analogues: transformations of a reactive warhead in the gastric environment. ACS Med. Chem. Lett. 2013;4:381–386. doi: 10.1021/ml3003633. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR70] 70.Dao HT, Li C, Michaudel Q, Maxwell BD, Baran PS. Hydromethylation of unactivated olefins. J. Am. Chem. Soc. 2015;137:8046–8049. doi: 10.1021/jacs.5b05144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR71] 71.Garozzo D, et al. Inclusion networks of a calix[5]arene-based exoditopic receptor and long-chain alkyldiammonium ions. Org. Lett. 2003;5:4025–4028. doi: 10.1021/ol035310b. [DOI] [PubMed] [Google Scholar]

[CR72] 72.Bats, J. W. CSD Commun. (2010).

[CR73] 73.Huang GB, et al. Selective recognition of aromatic hydrocarbons by endo-functionalized molecular tubes via C/N-H center dot center dot center dot pi interactions. Chin. Chem. Lett. 2018;29:91–94. doi: 10.1016/j.cclet.2017.07.005. [DOI] [Google Scholar]

[CR74] 74.Plater MJ, Harrison WA, Machado de los Toyos L, Hendry L. The consistent hexameric paddle-wheel crystallisation motif of a family of 2,4-bis(n-alkylamino)nitrobenzenes: alkyl=pentyl, hexyl, heptyl and octyl. J. Chem. Res. 2017;41:235–238. doi: 10.3184/174751917X14902201357356. [DOI] [Google Scholar]

PERMALINK

Chemical shifts in molecular solids by machine learning

Federico M Paruzzo

Albert Hofstetter

Félix Musil

Sandip De

Michele Ceriotti

Lyndon Emsley

Abstract

Introduction

Fig. 1.

Results

Training and validation using DFT calculated shifts of known crystal structures

Fig. 2.

Fig. 3.

Predicting shifts for polymorphs

Fig. 4.

Predicting experimental shifts and structure determination

Fig. 5.

Fig. 6.

Predicting shifts for large structures

Fig. 7.

Discussion

Methods

Computational details

Code availability

Electronic supplementary material

Acknowledgments

Author contributions

Data availability

Competing interests

Footnotes

Contributor Information

Electronic supplementary material

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases