Abstract
Organic crystal structures exert a profound impact on the physicochemical properties and biological effects of organic compounds. Quantum mechanics (QM)–based crystal structure predictions (CSPs) have somewhat alleviated the dilemma that experimental crystal structure investigations struggle to conduct complete polymorphism studies, but the high computing cost poses a challenge to its widespread application. The present study aims to construct DeepCSP, a feasible pure machine learning framework for minute-scale rapid organic CSP. Initially, based on 177,746 data entries from the Cambridge Crystal Structure Database, a generative adversarial network was built to conditionally generate trial crystal structures under selected feature constraints for the given molecule. Simultaneously, a graph convolutional attention network was used to predict the density of stable crystal structures for the input molecule. Subsequently, the distances between the predicted density and the definition-based calculated density would be considered to be the crystal structure screening and ranking basis, and finally, the density-based crystal structure ranking would be output. Two such distinct algorithms, performing the generation and ranking functionalities, respectively, collectively constitute the DeepCSP, which has demonstrated compelling performance in marketed drug validations, achieving an accuracy rate exceeding 80% and a hit rate surpassing 85%. Inspiringly, the computing speed of the pure machine learning methodology demonstrates the potential of artificial intelligence in advancing CSP research.
Graphical abstract

Public summary
-
•
The diverse crystal structures of organic compounds significantly influence their properties across fields, from optoelectronics to superconductors and from drugs to high-energy explosives.
-
•
Traditional methods face limitations in efficiency due to incomplete experimental crystallization and time-consuming quantum mechanics calculations.
-
•
DeepCSP leverages AI to achieve minute-scale predictions of organic crystal structures from two-dimensional molecular structures.
Introduction
Organic crystal structures, one of the highly regarded solid-state characteristics, refer to the microscopic molecular stacking form of solid crystalline organic compounds.1 Crystal structures exert direct or indirect influence on the physicochemical properties and biological effects of solid matter, including density, solubility, melting point, bioavailability, vibrational transitions, electronic transitions, and other highlighted properties.1 Crystal structure research has become an indispensable core component in diverse domains such as drug research and development (R&D), novel material exploration, electronics and photonics research, and high-energy explosives discovery. In drug R&D, for example, the crystalline form of drugs can affect the solid-state properties, solubility, hygroscopicity, stability, and bioavailability,2 whereas changes in these properties significantly affect the safety, efficacy, and manufacturability of pharmaceutical products. In 1832, the polymorphism phenomenon in molecular crystal structure was initially observed in research on benzamide,3 which prompted the ability to stack into different crystal structures of some compounds.4 Since then, an increasing number of compounds have been identified as possessing multiple stable crystal structures,5 suggesting the widespread existence of polymorphism. In 2018, Neumann and van de Streek6 concluded that approximately 15%–45% of marketed small-molecule drugs exist unidentified but more stable crystal structures. Determining the possibility of undesirable polymorphism has gradually become one of the central foci of crystal structure studies. Incomplete drug crystal structure studies may lead to catastrophic risks, exemplified by the case of ritonavir, an antiviral drug. After ritonavir was launched in 1996, the discovery of a less soluble crystal form led to a recall that directly resulted in a loss of over $250 million.6,7 In addition, in-depth crystallographic studies are expected to predict or design new solid-state forms, providing opportunities to acquire new intellectual property,8 establish high-technology thresholds, and extend the life cycle of pharmaceuticals.2,4
Although crystal structure research holds significant importance and presents opportunities, it is not without its challenges. For a long time, traditional experimental crystal structure screening has remained trapped in a trial-and-error procedure. Extensive labor, material, and time costs are invested in screening different crystal generation conditions, and thereafter to observe the structure by spectroscopic and thermal analysis.9,10 What makes it even more desperate is the uncertainty of the screening endpoint.11 According to the study of nucleation kinetics,12 the compound first crystallizes into a thermodynamically labile structure and then stops transforming into a thermodynamically stable structure due to high nucleation barriers. This labile structure would wait for a specific crystallization condition, possibly until industrial production, when it accidentally matches the suitable condition and transforms into a more stable crystal structure, which is unacceptable. Therefore, both traditional crystallographic screening methods (e.g., recrystallization, rapid solvent removal, lattice disruption) and tailored, more advanced methods, including freeze-drying,13 high-pressure crystallization,12 capillary crystallization,14 laser-induced nucleation,15 crystallization under external magnetic fields,16 and the use of supercritical antisolvents,17 are difficult for researchers to ensure that the right conditions are matched to generate stable crystalline forms with the lowest free energy during the short experimental screening process. Moreover, there is no confidence that all of the potential crystalline forms of a given compound will be found.
To address the challenges associated with experimental crystal structure screening, the application of computational techniques for crystal structure prediction (CSP) is a promising direction that escapes the limitations of crystallization conditions. The ultimate goal of CSP is to explore all of the stable crystal structures starting from two-dimensional (2D) molecular structures. CSP is based on the assumption that a thermodynamically stable crystal structure possesses relatively lower energy.18 The traditional approach to structure generation in CSP relies on extensive sampling of trial crystal structures, and the optimization of each trial structure is indispensable for expensive computing power. As for the energy-based ranking process, currently, quantum mechanics (QM) theory–guided lattice energy calculation methods are the mainstream of CSPs, including density functional theory; the Hartree-Fock method; coupled cluster singles, doubles, and perturbative triples theory; and so on.19 In recent years, computing power has improved spectacularly, but it still struggles to satisfy the expensive dynamic ab initio free energy calculation. Most existing CSPs have to replace the free energy by calculating the lattice energy under nonambient conditions,18 but still require months to complete an all-atom lattice energy calculation for a specific molecule. The possible missing trial space and the lengthy calculation cycle are the most significant limitations of QM-based CSP applications.
Improving structure generation efficiency and reducing computing costs are the evolutionary directions for CSP. Recently, data-driven machine learning has been a powerful force in chemistry studies, materials science, and drug discovery and development.20,21,22 In general, well-trained machine learning models exhibit linear inference complexity, whereas ab initio CSPs, whose computation is based on explicit physical models, lead to a much higher computational complexity. In this light, machine learning algorithms may be the potential solution to the pain point of QM-based CSPs. The Cambridge Crystal Structure Database (CSD),23 an open-source database established in 1965, has integrated over 1 million high-quality crystal structure data, offering unlimited possibilities for machine learning-driven CSP. Several studies have attempted to apply machine learning methods to inorganic molecular CSP.24,25,26 As an example, Kim et al.27 adopted atomic coordinate crystal representation to establish a Mg-Mn-O ternary inorganic material generative model, and the photoanodic properties, stability, and band gap of the materials were computationally predicted. Long’s team24 developed the constrained crystals deep convolutional generative adversarial network model to optimize attributes in the potential space and internalize the desired attributes as targets of the generator using conditional generation as a constraint, which was successfully applied to the BiSe binary system. However, compared to inorganic materials with highly regular crystal structures, the much larger number of atoms and highly complex chemical environment of organics bring an exponential difficulty increase for machine learning CSP. To the best of our knowledge, there are no reports that rely exclusively on machine learning algorithms for organic compounds CSP.
The generation and ranking of trial crystal structures are the two most essential subjects of CSP. Generative adversarial networks (GANs), a widely used generative learning algorithm proposed in 2014,28 may offer completely fresh opportunities for crystal structure generation. Regarding the ranking of trial crystal structures, based on the consensus that lower energy, more stable, CSPs rely on QM energy calculation for ranking, which is the rate-limiting step. It is a natural idea to try to predict lattice or free energy with machine learning algorithms; however, the shortage of data is a fatal limitation. After all, every data generation is expensive regarding this problem. How about choosing the more readily available crystal properties as the basis for ranking? In 1979, crystallographer A. Burger proposed the density rule29 based on statistical mechanics, which states that if one modification of a molecular crystal has a lower density than the other, it may be assumed to be less stable at absolute zero, and experimentally verified.30 Landscapes, one of the important outputs of CSP, is a scatterplot of potential crystal structures represented by energy and density. The density rule and the landscapes suggest a possible correlation between the stability of a crystal structure and its density, whereas the density rule29 even proposes the hypothesis that the higher the density, the more stable, which prompts that if machine learning models can be built to accurately predict the density of the most stable crystal structures, then the inexpensive predicted density may be used as the basis for ranking to find stable crystal structures, offering an alternative to the costly calculated free energy.
Following the above idea, the present study aims to develop a pure machine learning organic CSP framework called DeepCSP (Figure 1). A whole dataset for crystal structures of free-form (single component) organic molecules was obtained from CSD, which contains 177,746 data entries. Based on this dataset, two machine learning models with different functions were successfully constructed: the organic compound crystal structure conditional GAN (OCGAN) and the molecular graph convolutional network with the attention mechanism (MolGAT). OCGAN considered the structural characteristics of the input molecules and conditionally generated the potential crystal structures from the learned space (characterized by crystal structure parameters). Meanwhile, MolGAT was implemented to predict the density of stable crystal structures at room temperature. The graph network relies on the input of a 2D molecular map for target property prediction through a message passing algorithm, which can achieve high prediction accuracy while eliminating complex feature engineering. Next, the predicted density served as the basis for the trial crystal structure screening and ranking, resulting in a recommended crystal structure ranking map. After that, an additional crystal structure dataset of 423 marketed drugs was used to evaluate the overall performance of DeepCSP. Finally, several case studies (supplemental information) were used to showcase the functionalities and performance of DeepCSP.
Figure 1.
Overview of DeepCSP, the pure machine learning organic compound CSP framework
(A) Representation of crystal structures. Each crystal structure consists of 11 major space groups (occupying more than 90% of all 230 space groups), 6 cell parameters (a, b, c, α, β, γ), and z values.
(B) Modeling dataset building process.
(C) Molecular graph representation of organic compounds in GCNs.
(D) Framework of OCGAN (D1) and algorithms for crystal structure filtering and ranking (D2).
(E) Schematic diagram of crystal structure space filtering. Please see the supplemental information for calculation details.
(F) Left: structure of MolGAT for molecular stable crystal structure density prediction. Right: schematic diagram of the algorithms for graph convolution (top), the attention mechanism (center), and dense layers (bottom).
The generation and ranking of crystal structures are the two most important components of CSP, and they are closely intertwined, mutually supportive, and complementary. OCGAN and MolGAT operate within these two stages, and their coupling enables them to collaborate effectively within the DeepCSP framework, resulting in a synergistic effect and achieving highly accurate predictions of crystal structures. Altogether, DeepCSP, a pure machine learning framework, presents a fresh perspective on CSP, which reduces the computing cost by tens of thousands of times while maintaining the comparable capacity to explore potentially stable organic crystalline forms as ab initio CSP methods, demonstrating the great potential of the new paradigm of machine learning–driven CSP.
Results
Datasets
Whole dataset
This study began with free-form organic compounds. A raw dataset containing the target objects was obtained from the CSD. The front-end software package ConQuest31 was used to search the target crystal structures in CSD according to chemical elements, chemical names, molecular structures, or substructures, which led to the retrieval of an original dataset containing 257,029 data entries. Data preprocessing was then performed on the original dataset with the aim of addressing class imbalances in discrete data and distribution unevenness in continuous data. The detailed preprocessing workflow is described in the supplemental information. The resulting whole dataset for modeling contains 177,746 crystal structure entries. The calculated density distribution is shown in Figure 2A.
Figure 2.
A comprehensive overview of the dataset and feature screening results
(A) Calculated density distribution of all of the compounds in the whole dataset.
(B) Publication date distribution of all of the compounds in the whole dataset.
(C) Conditional features screening results (using accuracy as the evaluation metric).
(D) Conditional features screening results (using hit rate as the evaluation metric).
(E) Conditional feature combination screening results.
(F) The Δdensity single-factor screening result.
Data splitting
In the modeling of MolGAT, the whole dataset was split according to the three-subset splitting strategy, which is broadly accepted in machine learning—in other words, training models on the training subset (142,196 entries, 80%), tuning the model hyperparameters on the validation subset (17,775 entries, 10%), and testing the model performance on the test subset (17,775 entries, 10%). A time-based stratified sampling algorithm was applied for data splitting to avoid ignoring the small but widely validated crystal structure data published earlier (see Figure 2B).
Additional test dataset
Acquired from the marketed drug data subset of CSD, the additional test dataset for the overall performance evaluation of the CSP machine learning framework consisted of 423 marketed drugs with 936 crystal structures in total, 264 drugs of which were monomorph and 159 were polymorph.
OCGAN
Sample representation
The application of GAN to generate crystal structures presupposes that crystal structure can be reasonably expressed in mathematical form. Eight crystal structure parameters are used to represent a unique crystal structure: space group, z value, three lattice lengths (a, b, c), and three lattice angles (α, β, γ), where the space group describes the lattice symmetry, the z value characterizes the number of molecules in the lattice, and the six lattice parameters describe the physical dimensions and geometry to explore the degrees of freedom of the unit cell.
The nuclear charge properties, the 3D conformation, and the shape of molecules are closely associated with the lattice stacking pattern; therefore, selected molecular properties are used as conditional constraints of the GAN to guide crystal structure generation, known as a CGAN.
Features
The CSD Python API was applied for searching the simplified molecular-input line-entry system (SMILES)32 of molecules according to the Refcode (the unique identifier of the crystal structure in CSD). With SMILES as input, the features of molecular mass, the molecular features, were calculated by the RDKit33 and OpenBabel34 Python packages, including descriptors such as 'S_L,' 'M_L, ' 'S_M, ' 'S,' 'Globularity,' 'FrTPSA,' 'Dipole_moment,' and so forth, which represented a priori knowledge because they have been revealed to be highly correlated with the molecular crystallization by previous studies.35 The descriptions of molecular features were shown in Table 1.
Table 1.
Molecular feature conditions applied in OCGAN
| Feature | Description | Categories no. |
|---|---|---|
| MW | Molecular weight | 7 |
| S | Short axis of an enclosing box | 10 |
| S_L | Short/long axis of an enclosing box | 6 |
| S_M | Short/medium axis of an enclosing box | 6 |
| M_L | Medium/long axis of an enclosing box | 4 |
| Globularity | Surface of a sphere with the same volume as the molecule/solvent-accessible surface area | 6 |
| FrTPSA | Topological polarity surface area (TPSA)/solvent accessible surface area (SASA) | 5 |
| Dipole_Moment | Dipole moment | 5 |
Enclosing box, a molecular structure, where the outer boundary forms a closed box, encapsulating the actual molecule within.
In OCGAN, the conditional features are not assigned as continuous values, but instead are divided into categorical numbers, as shown in Table 1, leading to one-hot features. As shown in Figure 2, we initially used separate conditional features as constraints for structure generation (Figures 2C and 2D), and 'MW,' 'S_L,' and 'Dipole_moment' performed best in their respective categories in terms of both accuracy and hit rate. Subsequently, the combination of different classes of conditional features was evaluated (Figure 2E), and ultimately, the combination of 'MW,' 'S_L,' and 'Dipole_moment' was selected as the post hoc conditions, which represent the molecular size, 3D shape, and charge characteristics of a given compound, respectively.
Moreover, modeling of the three continuous columns of lattice angles, α, β, and γ, can be a challenge. An additional Boolean column was added for each angle parameter, which returned 90° for true and the specific non-90° for false. Such a transition could guide the model to learn first the Boolean discrete values and then the mixed Gaussian continuous features for non-90° data. Through the combination of the two distributions, the model is able to learn the highly unbalanced angle distributions more easily.
Model construction
In this study, we report a CGAN model for generating trial crystal structures that are represented as crystal structure vectors and stored in tabular format. See the supplemental information for details.
Figure 1D1 demonstrates the scheme of OCGAN. The generator and the critic competed against each other, whereas the parameters were tuned continuously and Nash equilibrium was eventually reached. As a result, the generator would be able to fool the critic by generating realistic candidates that are indistinguishable to the critic. Conditional generation with selected molecular properties as constraints was achieved by the rejection sampling process (i.e., repeated sampling until a qualified row was met), thus ensuring that the mandate of conditional generation is achieved.
Model evaluation
We made the trained OCGAN randomly generate a dataset containing the same amount of crystal structures as in the test set. The similarity between the two datasets served as evaluation metrics, which included the distribution of individual data columns, the overall data visualization by principal-component analysis (PCA) dimensionality reduction, the means and standard deviations of continuous columns, and the internal correlations. Furthermore, the performance of OCGAN is also reflected in the overall performance evaluation of DeepCSP.
As shown in Figure 3A, the real and generated data share similar normal distributions in lattice lengths (L_a, L_b, and L_c). The lattice angles (α_90, β_90, and γ_90) show a similar binary category ratio for the real and generated data, which is described as the true or false ratio for 90° in the Boolean bar chart. In addition, the α, β, and γ subplots described the distributions of specific angle values, which demonstrated not only the concentration at 90° but also the probability density distributions of specific non-90° data. Although there were some samples in the tail for the β, the distributions of lattice angles were well learned by the generated model. In addition, the real data and the generated data also share similar categorical distributions in other discrete variables (e.g., z value, space group number, molecular weight_category, S_L_category, S_category). These data distribution plots present the good learning ability of the generative model.
Figure 3.
Performance evaluation of OCGAN
(A) Comparison of the distribution of the real data with the generated data. L_a, L_b, L_c: lattice length; α_90, β_90, γ_90: distribution of 90° angle and non-90° angle data; α, β, γ: lattice angle; z value: number of molecules in each cell; molecular weight_category: discretization category of molecular mass data; S_L_category: discretization category of molecular short-edge to long-edge ratio; M_L_category: discretization category of molecular medium-long-edge to long-edge ratio; S_M_category: discretization category of molecular short-edge to medium-long-edge ratio; S_category: discretization category of molecular short edge; globularity_category: discretization category of surface area of spheres with the same volume and molecular area; FrTPSA_category: discretization category of TPSA/SASA; diopole moment_category: discretization category of dipole moment; space group number: a description of the crystal symmetry group in 3D space.
(B) Comparison of the mean and SD of the continuous columns of values between the real data and the generated data.
(C) Visualization of the distribution of the real data and the generated data by PCA dimensionality reduction.
(D) Internal correlation matrix of the real data (left) and the generated data (center) and the difference in correlation between them (right).
Figure 3B demonstrated the similar values of means and standard deviations of the generated and the real data. The original high-dimensional data were transformed into visualized 2D scatterplots by PCA dimensionality reduction (Figure 3C) and into 2D correlation matrix heatmaps (Figure 3D). The scatterplot comparison revealed that the generated data effectively captured the distribution of real data, extending slightly beyond it. This indicates that the generative model not only learned the data distribution but also explored unobserved regions. The 2D correlation matrix heatmaps indicate that the generative model successfully learned the internal correlations.
MolGAT
Figure 1F demonstrated the structure of MolGAT for molecular stable crystal structure density prediction. Local structural features were captured by convolutional operations and then embedded in the attention layer to obtain the global structural dependencies. The model simultaneously considered both global and local structural information. The appropriate molecular representations were generated to achieve accurate stable crystal structure density predictions directly from 2D molecular structures. Polymorph compounds in the whole dataset are marked by the International Chemical Identifier, which was converted from simple data format files using the RDKit toolkit. Polymorphs account for only approximately 3% in CSD, and the densities of different crystalline forms of the same compound are relatively close, with a mean coefficient of variation of approximately 1%. Therefore, the mean values of the stable crystal structure densities would be applied in the modeling of the polymorph compounds.
Molecular representation
In graph convolutional networks (GCNs), each molecule is represented as an undirected graph. To characterize the local chemical environment, atoms and bonds were encoded as structural features, in which the attribute information was retained. The relative contributions of these features will be learned by MolGAT. An atom is described as a 38-dimensional feature vector and a bond as a 6-dimensional feature vector, respectively. All of the atomic features and bond features are obtained by using the RDKit toolkit, and specific information is shown in Table S1.
Graph convolutional layer
GCNs are able to automatically learn task-specific representations using graph convolution The graph convolutional layer recurrently passes information about neighboring atoms and bonds to the central atom by circular convolution, aggregating this into the embedded information about its surrounding chemical environment. The learned embeddings can be used to predict molecular properties.
Attention mechanism
The attention mechanism applies an attention layer to capture the different importance of fragments in determining crystal density. The attention mechanism uses the weighted sum instead of a simple global pooling step or calculating the sum. Here, the scaled dot-product additive attention was applied.
Model performance
According to the common metrics in machine learning, the mean absolute error (MAE), the mean square error (MSE), the root mean square error (RMSE), and the coefficient of determination (R2) were applied.
The performance of MolGAT is shown in Table 2. In Table 2, MolGAT demonstrates a high level of predictive performance on both the validation and test sets. On the final test set, all three error evaluation metrics (MAE <0.04, MSE <0.003, RMSE <0.05) fall within a low error range. The R2 value reaches 0.917 on the test set, indicating a high degree of proximity between predicted and actual values. These results demonstrate the effective modeling and accurate prediction of stable crystal structure densities for organic compounds by MolGAT.
Table 2.
Performance of MolGAT
| Metrics | Training | Validation | Test |
|---|---|---|---|
| MAE (g/cm3) | 0.0336 | 0.0342 | 0.0365 |
| MSE (g/cm3)2 | 0.0019 | 0.0020 | 0.0023 |
| RMSE (g/cm3) | 0.0436 | 0.0447 | 0.0480 |
| R2 | 0.934 | 0.932 | 0.917 |
DeepCSP framework
DeepCSP is developed upon the coupling of OCGAN and MolGAT. In general, CSP based on machine learning is a process to continuously filter and reduce the vector space of crystal structures (Figure 1E). The vector space is initially spanned by all of the crystal structure parameters and is reduced to a trial crystal structure space generated by the features of molecular mass and molecular features. Subsequently, the candidate crystal structures are selected based on the predicted density of stable crystals and are ultimately ranked and recommended by density.
Crystal structure generation
First, the selected feature descriptors, including molecular weight and molecular features, of the input compounds were calculated and discretized. These descriptors were used as constraints, and the trained GAN model was invoked to generate the trial crystal structures conditionally. Necessary postprocessing was then performed to avoid generating data that violate the crystalline rules. Specific procedures are described in the supplemental information.
Second, the density of each generated crystal structure was calculated. This study mainly focused on the 11 space groups of the 230 total, which covered 97% of the crystal structures in the CSD. These 11 space groups belonged to 3 crystal systems, and the cell volume (Å3) and density (g/cm3) of the generated crystal structure were calculated by the following equations:
where V is the cell volume; a, b, c are the lattice lengths; α, β, γ are the lattice angles; n is the number of molecules in the cell; M is the molecular weight, and Na is the Avogadro constant.
Third, the trained MolGAT was invoked. The desired stable crystal structure density was obtained for the input compounds and named as the MolGAT predicted density.
Fourth, the difference between the MolGAT predicted density and the calculated density of the generated crystal structure, Δdensity, was evaluated and described by the following equation:
The generated crystal structures were screened by density, where those with a Δ density less than 0.05 g/cm3 would be retained, which is the optimal threshold determined by a single-factor screening experiment, as shown in Figure 2F.
Fifth, clustering similar crystal structures—during the crystal structure generation process, a real-time analysis was performed to determine whether the newly generated crystal structure had been produced previously. Specific procedures are described in the supplemental information.
Following the procedures above, 1,000 trial crystal structures were generated for each input compound.
Crystal structure ranking
Since MolGAT has been trained on the density data of realistic stable crystal structures, the predicted crystal structure density for input compounds is considered to be the mathematical expectation of its most stable crystal structure density in reality. The 1,000 trial crystal structures generated were ranked in ascending order by Δdensity, and a density ranking map was obtained for crystal structure recommendation.
Marketed drugs validation
We use the crystal structures of marketed drugs as an additional validation dataset. Three metrics were introduced for evaluating the predictive performance: the accuracy, the hit rate, and the top rank of the hit structures. They were described in the following equations:
Specifically, right prediction drugs referred to the drug whose known structures were included in the predicted structures.
where for structure hit drugs, at least one known structure was included in the predicted structures.
The accuracy and hit rate reflect the ability of DeepCSP to detect the potentially stable crystal structures. The rank of the hit structures reflects the current crystallographic study completeness of the given compound. Qualitatively, the higher the rank of a crystal structure, the larger the probability that it is the most thermodynamically stable structure. To semiquantitatively characterize the meaning of ranking, we have collated the ranking distribution of the top-ranked hit structures for marketed drugs, as shown in Figure 4A. It was found that approximately 15%–45% of the marketed drugs existed more stable crystal forms not identified,6 the proportions corresponding to the curve of complementary cumulative distribution function in the rank distribution of marketed drugs is 565 and 227, respectively. On this basis, the crystallographic study completeness of a given compound can be tentatively estimated from the rank of the top-ranked hit structures; a ranking in the top 227 indicates relatively high completeness; a ranking between 227 and 565 suggests moderate completeness; a ranking after 565 can be considered relatively low completeness. Specific procedures are described in the supplemental information.
Figure 4.
Detailed assessment of crystal structure ranking and density prediction for marketed drugs
(A) Ranking distribution of the top-ranked crystal structures of marketed drugs; the orange line is the curve of complementary cumulative distribution function.
(B) Density prediction error (the predicted value minus the real value) of MolGAT on the marketed drug dataset.
The validation results for the marketed drug dataset are shown in Table 3. Overall, DeepCSP possesses an accuracy and hit rate of over 80% on the marketed drug dataset, with an average rank of 274.
Table 3.
Performance of DeepCSP with different numbers of candidate structures generated
| Drug | 1,000 |
500 |
300 |
100 |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Hit rate | Averaged rank | Accuracy | Hit rate | Averaged rank | Accuracy | Hit rate | Averaged rank | Accuracy | Hit rate | Averaged rank | |
| Monomorph, % | 84.85 | 84.85 | 260 | 70.45 | 70.45 | 186 | 53.79 | 53.79 | 124 | 23.48 | 23.48 | 48 |
| Polymorph, % | 75.37 | 91.79 | 296 | 58.96 | 84.33 | 185 | 35.82 | 66.42 | 126 | 15.67 | 41.79 | 43 |
| Total, % | 81.29 | 87.46 | 274 | 66.13 | 75.67 | 186 | 47.04 | 58.54 | 125 | 20.54 | 30.36 | 46 |
We conducted a comprehensive evaluation of how the number of candidate crystal structures affects the performance of DeepCSP. To validate the model’s accuracy, we narrowed down the screening scope to 500, 300, and 100 candidates. The results are as follows (Table 3). As the number of candidates decreases, both accuracy and hit rates exhibit a declining trend. When we reduced the number of candidate crystal structures by half to 500, the accuracy and hit rate still remained consistently above 60% and 70%, respectively. This substantial improvement enhances DeepCSP’s practical applicability and reference value for crystal scientists in real-world scenarios. In addition, this efficiency can be achieved within just 1 core-hour, greatly boosting research and development efficiency.
For the case study involving the sixth blind test target compounds (XXII–XXVI) and the 5-methyl-2-[(2-nitrophenyl)amino]-3-thiophenecabonitrile (ROY) organic compound with 15 different crystalline forms, please refer to the supplemental information.
Discussion
The present study reports for the first time a pure machine learning framework, in the novel pattern for generative models to work in tandem with discriminative modules, to predict crystal structures for organic compounds. As shown in Figure 5, referring to the path of first principles–based CSP, DeepCSP has demonstrated its distinctive advantages. Compared with ab initio CSP methods, the proposed DeepCSP framework has at least four advantages: much faster computing speed, more powerful sampling capability, better realistic applicability, and stronger generalizability.
Figure 5.
Comparison of ab initio CSPs and the proposed machine learning–driven CSP
The prediction speed of the pure machine learning framework far exceeds that of ab initio CSP methods. The computing time of ab initio CSP methods is estimated to exceed 107 s for each given molecule, depending on the energy calculation accuracy and hardware conditions. In the sixth blind test,36 25 participating teams completed 84 submissions, spending a total of over 40 million central processing unit (CPU)-hours, averaging over 476,900 CPU-hours per submission. As a comparison, our DeepCSP completes the same tasks on a laptop computer in less than 1 CPU-hour, demonstrating a remarkable reduction in computing costs of over tens of thousands of times.
Possible crystal structures generation is the start of lattice energy minimization. In ab initio CSPs, there are several types of methods for generating sets of trial crystal packing. The commonly used generation methods include grid search, random search, quasi-random search, and their variants. Again, taking the sixth blind test as a reference,36 15/25 of the submissions adopted the relatively simpler random search methods. However, random searching that completely relies on random numbers usually requires large-scale sampling to achieve a complete exploration of the phase space. Quasi-random searches that consider factors such as the target volume parameter37 incorporate a simple mapping of random numbers to structural parameters, yet without learning the complex continuous crystal structure space. Alternative crystal structure generation methods include global optimization algorithms such as Monte Carlo search,38 simulated annealing,39 and evolutionary or genetic algorithms.40,41 These methods have been used to address the challenge of CSP for large and flexible molecules,42,43,44 with the attractive feature of finding local minima and often demonstrating robust performance.12 However, the effective treatment of the structure, energy, and properties between the molecules, as well as the internal potentials of the crystal structures, is also achieved by QM calculations, which implies more investment of computing resources. In addition, it has been found to match target molecules with molecules of known experimental crystal structures in the CSD to generate similar crystal structures36; the limitation would be the inability to escape from the existing templates to explore broader potential molecule packings. Obviously, the explored crystal structures (approximately 1 million) are still dwarfed by the complete crystal structure space. The pure machine learning approach we present here can sample from the learned continuous crystal structure space with a low computing cost. Furthermore, the generative models possess the ability to explore uninvolved crystal structure spaces and internalize molecular property conditions in the generation process to explore the target structure space in a directed manner.
Following the generation of trial crystal structures, CSPs optimize or minimize the initial stack and then rank the structures by stability or likelihood of occurrence, using some form of energy-based metrics. The relative thermodynamic stability of a polymorph is determined by its free energy consisting of the contributions of both zero-point and thermal motions to the lattice enthalpy and entropy.36 In 1995, Gavezzotti and Filippini45 proposed that enthalpy was more important than entropy in determining the stability of polymorphisms. Exploring crystal structure stability under given conditions of temperature and pressure requires more expensive quantum dynamics calculations, such as molecular dynamics simulations with enhanced sampling,46 the Einstein crystal method,47 and so forth. In contrast, in the proposed pure machine learning–based approach proposed, the model is trained on real crystal structure and density data at room temperature and pressure. The model generated learns the vector space tensed by the actual crystal structure vectors and the graph network maps from the molecular structure to the crystal density at room temperature and pressure. The advantage of such “direct learning” makes our method more practical.
Through its development, ab initio CSP has branched into several directions. These branches differ in terms of computational accuracy and applicability, with variations that can be summarized as a distinction in pseudo-potential and estimate exchange-correlation functional.48,49,50 The type of force field chosen also varies depending on the type of molecule and requires the consideration of a variety of factors, such as molecular flexibility, short-range or long-range interactions, and so on.51 Although the choice of the appropriate computational method for a specific case improves the prediction accuracy to some extent, it also raises the threshold for applying ab initio CSP methods for scientists in other fields. The pure machine learning framework trained on various organic compounds provides a unified computational framework as a pervasive approach, making it user-friendly for non-CSP experts.
Inspired by the generation-ranking concept of CSP, the present study demonstrates the potential of pure machine learning CSP for organic compounds using the stable crystal structure density, which is more feasible for prediction at this stage, as an indicator for trial structures screening and ranking. However, there are still limitations, with the most crucial being that the accuracy of the current stable crystal structure density predictions is still not comparable to that of lattice energy calculations. In the 1960s, McCrone52 made the famous assertion about polymorphism: “For given compounds, the number of crystalline forms discovered is proportionate to the amount of time and money spent.” However, fewer than 3% of the compounds in the CSD are polymorphs, revealing that current organic crystal structure studies are not sufficient and there may be a large number of substable crystalline forms in the CSD with densities smaller than the stable crystal structures according to the density rules. It is essential to acknowledge that due to data quality, the densities predicted by MolGAT will also be slightly less than the actual densities of the stable crystal structures. As outlined, the crystal structures of marketed drugs are recognized to be relatively more well investigated, which implies that the predicted densities of marketed drugs should be smaller compared to the actual values. An additional performance evaluation of MolGAT was carried out using the marketed drug dataset, and the results are shown in Figure 4B, where the MolGAT predicted densities for marketed drugs are on the small side, in agreement with our theoretical hypothesis. This brief verification demonstrates at least two facts: first, the reliability of the density rules, and second, that the quality of the current stable crystal structure density dataset requires further improvement. Therefore, improving the prediction accuracy of stable crystal structure densities should be an indispensable direction to improve the performance of DeepCSP for trial crystal structure screening and ranking. Of course, our proposed density-based pure machine learning organic CSP framework can also be quickly converted to lattice energy or free energy predictions for quick and more accurate CSP when accurate crystal energy data at specific temperature and pressure are available. However, the extension of the current version of DeepCSP to all types of organic compounds is also in our research scheme.
Conclusion
This study has developed a novel pure machine learning framework called DeepCSP to achieve minute-level rapid CSP for organic compounds. Based on 177,746 crystal structure data entries, a CGAN model was implemented to address the trial crystal structures generation task in CSP. The embedding of prior knowledge have improved the quality of the generative network. Concurrently, a GCN model was constructed to predict the stable crystal structure density of the input compound, and the attention strategy was incorporated to improve the model performance. Next, the predicted density was innovatively used as the basis for the screening and ranking of trial crystal structures, in which way the rapid prediction was enabled. Altogether, the proposed innovative pure machine learning framework combines pioneering ideas, effective technical strategies, and advanced artificial intelligence algorithms to accelerate CSP, while retaining the ability to explore potentially stable organic crystalline forms. The current framework holds the potential to serve as a valuable tool in guiding crystal engineering efforts, and also leading the way to a new era in machine learning–driven CSP. Looking ahead, with deeper integration into CSP mindsets, DeepCSP is poised to deliver more reliable stacking information and enhanced generalization capabilities.
Materials and methods
See the supplemental information for details.
Acknowledgments
We are thankful for the funding provided by University of Macau Research Grant (MYRG-CRG2022-00008-ICMS), Shenzhen-Hong Kong-Macau Science and Technology Program (Category C) of Shenzhen Science and Technology Innovation Commission (SGDX20210823103802016), and industry-university-research cooperation project and Zhuhai-Hong Kong-Macao cooperation project from Zhuhai Science and Technology Innovation Bureau (ZH22017002210010PWC). This study was partially performed at Super Intelligent Computing Center, which is supported by Internet of Things for Smart City of the University of Macau. We thank the Macao Polytechnic University for the financial support of the CSD database.
Author contributions
Z.Y. and D.O. planned the study and drafted the manuscript. Z.Y., N.W., and D.O. contributed to data analysis and interpretation. Z.Y. designed the models and performed the experiments. Z.Y., N.W., J.Z., and D.O. participated in the manuscript revision. All of the authors have given final approval for the manuscript to be published and have agreed to be responsible for all of the aspects of the manuscript.
Declaration of interests
Z.Y., N.W., and D.O. are applying for a patent on DeepCSP method. J.Z. declares no competing interests.
Published Online: January 8, 2024
Footnotes
It can be found online at https://doi.org/10.1016/j.xinn.2023.100562.
Lead contact website
Defang Ouyang: https://sklqrcm.um.edu.mo/de-fang-ouyang/
Zhuyifan Ye: https://sites.google.com/view/yezhuyifan/
Supplemental information
References
- 1.Price S.L. Predicting crystal structures of organic compounds. Chem. Soc. Rev. 2014;43:2098–2111. doi: 10.1039/C3CS60279F. [DOI] [PubMed] [Google Scholar]
- 2.Hilfiker R., Von Raumer M. John Wiley & Sons; 2019. Polymorphism in the Pharmaceutical Industry: Solid Form and Drug Development. [DOI] [Google Scholar]
- 3.Wöhler, Liebig Untersuchungen über das Radikal der Benzoesäure. Ann. Pharm. (Poznan) 1832;3:249–282. doi: 10.1002/jlac.18320030302. [DOI] [Google Scholar]
- 4.Brog J.-P., Chanez C.-L., Crochet A., et al. Polymorphism, what it is and how to identify it: a systematic review. RSC Adv. 2013;3:16905–16931. doi: 10.1039/c3ra41559g. [DOI] [Google Scholar]
- 5.Kersten K., Kaur R., Matzger A. Survey and analysis of crystal polymorphism in organic structures. IUCrJ. 2018;5:124–129. doi: 10.1107/S2052252518000660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Neumann M.A., van de Streek J. How many ritonavir cases are there still out there? Faraday Discuss. 2018;211:441–458. doi: 10.1039/C8FD00069G. [DOI] [PubMed] [Google Scholar]
- 7.Bauer J., Spanton S., Henry R., et al. Ritonavir: An extraordinary example of conformational polymorphism. Pharm. Res. (N. Y.) 2001;18:859–866. doi: 10.1023/A:1011052932607. [DOI] [PubMed] [Google Scholar]
- 8.Bernstein J. Oxford University Press; 2020. Polymorphism in molecular crystals 2e. [DOI] [Google Scholar]
- 9.Taylor C.R., Mulvee M.T., Perenyi D.S., et al. Minimizing polymorphic risk through cooperative computational and experimental exploration. J. Appl. Comput. Sci. 2020;142:16668–16680. doi: 10.1021/jacs.0c06749. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Liu F., Chen K., Xue D. How to fast grow large-size crystals? Innovation. 2023;4 doi: 10.1016/j.xinn.2023.100458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Bučar D.K., Lancaster R.W., Bernstein J. Disappearing polymorphs revisited. Angew. Chem. Int. Ed. 2015;54:6972–6993. doi: 10.1002/anie.201410356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Neumann M.A., Van De Streek J., Fabbiani F.P.A., et al. Combined crystal structure prediction and high-pressure crystallization in rational pharmaceutical polymorph screening. Nat. Commun. 2015;6:7793–7797. doi: 10.1038/ncomms8793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Magrasó A., Frontera C., Marrero-López D., et al. New crystal structure and characterization of lanthanum tungstate “La6WO12” prepared by freeze-drying synthesis. Dalton Trans. 2009:10273–10283. doi: 10.1039/B916981B. [DOI] [PubMed] [Google Scholar]
- 14.Childs S.L., Chyall L.J., Dunlap J.T., et al. A metastable polymorph of metformin hydrochloride: isolation and characterization using capillary crystallization and thermal microscopy techniques. Cryst. Growth Des. 2004;4:441–449. doi: 10.1021/cg034243p. [DOI] [Google Scholar]
- 15.Zaccaro J., Matic J., Myerson A.S., et al. Nonphotochemical, laser-induced nucleation of supersaturated aqueous glycine produces unexpected γ-polymorph. Cryst. Growth Des. 2001;1:5–8. doi: 10.1021/cg0055171. [DOI] [Google Scholar]
- 16.Potticary J., Terry L.R., Bell C., et al. An unforeseen polymorph of coronene by the application of magnetic fields during crystal growth. Nat. Commun. 2016;7:11555–11557. doi: 10.1038/ncomms11555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Reverchon E. Supercritical antisolvent precipitation of micro-and nano-particles. J. Supercrit. Fluids. 1999;15:1–21. doi: 10.1016/S0896-8446(98)00129-6. [DOI] [Google Scholar]
- 18.Price S.L. From crystal structure prediction to polymorph prediction: interpreting the crystal energy landscape. Phys. Chem. Chem. Phys. 2008;10:1996–2009. doi: 10.1039/b719351c. [DOI] [PubMed] [Google Scholar]
- 19.Han Y., Ali I., Wang Z., et al. Machine learning accelerates quantum mechanics predictions of molecular crystals. Phys. Rep. 2021;934:1–71. doi: 10.1016/j.physrep.2021.08.002. [DOI] [Google Scholar]
- 20.Wang W., Ye Z., Gao H., et al. Computational pharmaceutics-A new paradigm of drug delivery. J. Contr. Release. 2021;338:119–136. doi: 10.1016/j.jconrel.2021.08.030. [DOI] [PubMed] [Google Scholar]
- 21.Xu Y., Liu X., Cao X., et al. Artificial intelligence: A powerful paradigm for scientific research. Innovation. 2021;2 doi: 10.1016/j.xinn.2021.100179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Xu Y., Wang F., An Z., et al. Artificial intelligence for science—bridging data to wisdom. Innovation. 2023;4 doi: 10.1016/j.xinn.2023.100525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Taylor R., Wood P.A. A million crystal structures: The whole is greater than the sum of its parts. Chem. Rev. 2019;119:9427–9477. doi: 10.1021/acs.chemrev.9b00155. [DOI] [PubMed] [Google Scholar]
- 24.Long T., Fortunato N.M., et al. Constrained crystals deep convolutional generative adversarial network for the inverse design of crystal structures. npj Comput. Mater. 2021;7:66–67. doi: 10.1038/s41524-021-00526-4. [DOI] [Google Scholar]
- 25.Noh J., Kim J., Stein H.S., et al. Inverse design of solid-state materials via a continuous representation. Matter. 2019;1:1370–1384. doi: 10.1016/j.matt.2019.08.017. [DOI] [Google Scholar]
- 26.Cheng G., Gong X.-G., Yin W.-J. Crystal structure prediction by combining graph network and optimization algorithm. Nat. Commun. 2022;13:1492–1498. doi: 10.1038/s41467-022-29241-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kim S., Noh J., Gu G.H., et al. Generative adversarial networks for crystal structure prediction. ACS Cent. Sci. 2020;6:1412–1420. doi: 10.1021/acscentsci.0c00426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Radford A., Metz L., Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv. 2015 doi: 10.48550/arXiv.1511.06434. Preprint at. [DOI] [Google Scholar]
- 29.Burger A., Ramberger R. On the polymorphism of pharmaceuticals and other molecular crystals. Mikrochim. Acta. 1979;72:259–271. doi: 10.1007/BF01197379. [DOI] [Google Scholar]
- 30.Burger A., Ramberger R. On the polymorphism of pharmaceuticals and other molecular crystals. II. Mikrochim. Acta. 1979;72:273–316. doi: 10.1007/BF01197380. [DOI] [Google Scholar]
- 31.Bruno I.J., Cole J.C., Edgington P.R., et al. New software for searching the Cambridge Structural Database and visualizing crystal structures. Acta Crystallogr. B. 2002;58:389–397. doi: 10.1107/S0108768102003324. [DOI] [PubMed] [Google Scholar]
- 32.Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988;28:31–36. doi: 10.1021/ci00057a005. [DOI] [Google Scholar]
- 33.Landrum G. RDKit: Open-source cheminformatics. 2006. [DOI]
- 34.O'Boyle N.M., Banck M., James C.A., et al. Open Babel: An open chemical toolbox. J. Cheminf. 2011;3:1–14. doi: 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Fábián L. Cambridge structural database analysis of molecular complementarity in cocrystals. Cryst. Growth Des. 2009;9:1436–1443. doi: 10.1021/cg800861m. [DOI] [Google Scholar]
- 36.Reilly A.M., Cooper R.I., Adjiman C.S., et al. Report on the sixth blind test of organic crystal structure prediction methods. Acta Crystallogr. B. 2016;72:439–459. doi: 10.1107/S2052520616007447. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Case D.H., Campbell J.E., Bygrave P.J., et al. Convergence properties of crystal structure prediction by quasi-random sampling. J. Chem. Theor. Comput. 2016;12:910–924. doi: 10.1021/acs.jctc.5b01112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Pillardy J., Arnautova Y.A., Czaplewski C., et al. Conformation-family Monte Carlo: A new method for crystal structure prediction. Proc. Natl. Acad. Sci. USA. 2001;98:12351–12356. doi: 10.1073/pnas.231479298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Karfunkel H.R., Gdanitz R.J. Ab Initio prediction of possible crystal structures for general organic molecules. J. Comput. Chem. 1992;13:1171–1183. doi: 10.1002/jcc.540131002. [DOI] [Google Scholar]
- 40.Kim S., Orendt A.M., Ferraro M.B., et al. Crystal structure prediction of flexible molecules using parallel genetic algorithms with a standard force field. J. Comput. Chem. 2009;30:1973–1985. doi: 10.1002/jcc.21189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Lund A.M., Pagola G.I., Orendt A.M., et al. Crystal structure prediction from first principles: The crystal structures of glycine. Chem. Phys. Lett. 2015;626:20–24. doi: 10.1016/j.cplett.2015.03.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Pyzer-Knapp E.O., Thompson H.P.G., Schiffmann F., et al. Predicted crystal energy landscapes of porous organic cages. Chem. Sci. 2014;5:2235–2245. doi: 10.1039/c4sc00095a. [DOI] [Google Scholar]
- 43.Kendrick J., Stephenson G.A., Neumann M.A., et al. Crystal structure prediction of a flexible molecule of pharmaceutical interest with unusual polymorphic behavior. Cryst. Growth Des. 2013;13:581–589. doi: 10.1021/cg301222m. [DOI] [Google Scholar]
- 44.Price S.L., Leslie M., Welch G.W.A., et al. Modelling organic crystal structures using distributed multipole and polarizability-based model intermolecular potentials. Phys. Chem. Chem. Phys. 2010;12:8478–8490. doi: 10.1039/c004164e. [DOI] [PubMed] [Google Scholar]
- 45.Gavezzotti A., Filippini G. Polymorphic forms of organic crystals at room conditions: thermodynamic and structural implications. J. Appl. Comput. Sci. 1995;117:12299–12305. doi: 10.1021/ja00154a032. [DOI] [Google Scholar]
- 46.Yu T.-Q., Tuckerman M.E. Temperature-accelerated method for exploring polymorphism in molecular crystals based on free energy. Phys. Rev. Lett. 2011;107 doi: 10.1103/PhysRevLett.107.015701. [DOI] [PubMed] [Google Scholar]
- 47.Frenkel D., Ladd A.J.C. New Monte Carlo method to compute the free energy of arbitrary solids. Application to the fcc and hcp phases of hard spheres. J. Chem. Phys. 1984;81:3188–3193. doi: 10.1063/1.448024. [DOI] [Google Scholar]
- 48.Uzoh O.G., Galek P.T.A., Price S.L. Analysis of the conformational profiles of fenamates shows route towards novel, higher accuracy, force-fields for pharmaceuticals. Phys. Chem. Chem. Phys. 2015;17:7936–7948. doi: 10.1039/c4cp05525j. [DOI] [PubMed] [Google Scholar]
- 49.Broo A., Nilsson Lill S.O. Transferable force field for crystal structure predictions, investigation of performance and exploration of different rescoring strategies using DFT-D methods. Acta Crystallogr. B. 2016;72:460–476. doi: 10.1107/S2052520616006831. [DOI] [PubMed] [Google Scholar]
- 50.Cutini M., Civalleri B., Corno M., et al. Assessment of different quantum mechanical methods for the prediction of structure and cohesive energy of molecular crystals. J. Chem. Theor. Comput. 2016;12:3340–3352. doi: 10.1021/acs.jctc.6b00304. [DOI] [PubMed] [Google Scholar]
- 51.Price S.L. Is zeroth order crystal structure prediction (CSP_0) coming to maturity? What should we aim for in an ideal crystal structure prediction code? Faraday Discuss. 2018;211:9–30. doi: 10.1039/C8FD00121A. [DOI] [PubMed] [Google Scholar]
- 52.McCrone W. Wiley-Interscience; 1965. Physics and Chemistry of the Organic Solid State. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





