Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2022 Nov 30;37(6):1464–1477. doi: 10.1007/s11390-021-0970-3

Towards Exploring Large Molecular Space: An Efficient Chemical Genetic Algorithm

Jian-Fu Zhu 1, Zhong-Kai Hao 1, Qi Liu 1,, Yu Yin 1, Cheng-Qiang Lu 1, Zhen-Ya Huang 1, En-Hong Chen 1
PMCID: PMC9797891  PMID: 36594005

Abstract

Generating molecules with desired properties is an important task in chemistry and pharmacy. An efficient method may have a positive impact on finding drugs to treat diseases like COVID-19. Data mining and artificial intelligence may be good ways to find an efficient method. Recently, both the generative models based on deep learning and the work based on genetic algorithms have made some progress in generating molecules and optimizing the molecule's properties. However, existing methods need to be improved in efficiency and performance. To solve these problems, we propose a method named the Chemical Genetic Algorithm for Large Molecular Space (CALM). Specifically, CALM employs a scalable and efficient molecular representation called molecular matrix. Then, we design corresponding crossover, mutation, and mask operators inspired by domain knowledge and previous studies. We apply our genetic algorithm to several tasks related to molecular property optimization and constraint molecular optimization. The results of these tasks show that our approach outperforms the other state-of-the-art deep learning and genetic algorithm methods, where the z tests performed on the results of several experiments show that our method is more than 99% likely to be significant. At the same time, based on the experimental results, we point out the insufficiency in the experimental evaluation standard which affects the fair evaluation of previous work.

Supplementary Information

The online version contains supplementary material available at 10.1007/s11390-021-0970-3.

Keywords: data mining, molecular generation, genetic algorithm, drug discovery, artificial intelligence

Supplementary Information

ESM 1 (107.4KB, pdf)

(PDF 107 kb)

Acknowledgement

The authors would like to thank the valuable comments from the reviewers and those important corrections from Dr. Jan H. Jenson.

Contributor Information

Jian-Fu Zhu, Email: jeffzhu@mail.ustc.edu.cn.

Zhong-Kai Hao, Email: hzk171805@mail.ustc.edu.cn.

Qi Liu, Email: qiliuql@ustc.edu.cn.

Yu Yin, Email: yxonic@mail.ustc.edu.cn.

Cheng-Qiang Lu, Email: lunar@mail.ustc.edu.cn.

Zhen-Ya Huang, Email: huangzhy@ustc.edu.cn.

En-Hong Chen, Email: cheneh@ustc.edu.cn.

References

  • 1.DiMasi JA, Grabowski HG, Hansen RW. Innovation in the pharmaceutical industry: New estimates of R&D costs. Journal of Health Economics. 2016;47:20–33. doi: 10.1016/j.jhealeco.2016.01.012. [DOI] [PubMed] [Google Scholar]
  • 2.Sanchez-Lengeling B, Aspuru-Guzik A. Inverse molecular design using machine learning: Generative models for matter engineering. Science. 2018;361(6400):360–365. doi: 10.1126/science.aat2663. [DOI] [PubMed] [Google Scholar]
  • 3.Broadbelt LJ, Stark SM, Klein MT. Computer generated pyrolysis modeling: On-the-y generation of species, reactions, and rates. Industrial and Engineering Chemistry Research. 1994;33(4):790–799. doi: 10.1021/ie00028a003. [DOI] [Google Scholar]
  • 4.Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pretraining of deep bidirectional transformers for language understanding. arXiv.: 1810.04805, 2018. https://arxiv.org/abs/1810.04805, Nov. 2022.
  • 5.Girshick R. Fast R-CNN. In Proc. the 15th IEEE International Conference on Computer Vision, December 2015, pp.1440-1448. 10.1109/ICCV.2015.169.
  • 6.He KM, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2020;42(2):386–397. doi: 10.1109/TPAMI.2018.2844175. [DOI] [PubMed] [Google Scholar]
  • 7.LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998;86(11):2278–2324. doi: 10.1109/5.726791. [DOI] [Google Scholar]
  • 8.Peters J, Schaal S. Policy gradient methods for robotics. In Proc. the 19th IEEE/RSJ International Conference on Intelligent Robots and Systems, October 2006, pp.2219-2225. 10.1109/IROS.2006.282564.
  • 9.Liu Q, Allamanis M, Brockschmidt M, Gaunt A L. Constrained graph variational autoencoders for molecule design. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2018, pp.7806-7815.
  • 10.Schütt KT, Arbabzadah F, Chmiela S, Müller KR, Tkatchenko A. Quantum-chemical insights from deep tensor neural networks. Nature Communications. 2017;8:13890. doi: 10.1038/ncomms13890. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Lu C Q, Liu Q, Wang C, Huang Z Y, Lin P Z, He L X. Molecular property prediction: A multilevel quantum interactions modeling perspective. In Proc. the 33rd AAAI Conference on Artificial Intelligence, Jul. 2019, pp.1052-1060. 10.1609/aaai.v33i01.33011052.
  • 12.You J X, Liu B W, Ying R, Pande V, Leskovec J. Graph convolutional policy network for goal-directed molecular graph generation. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2018, pp.6412-6422.
  • 13.Hao Z K, Lu C Q, Huang Z Y,Wang H, Hu Z Y, Liu Q, Chen E H, Lee C. ASGN: An active semi-supervised graph neural network for molecular property prediction. In Proc. the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2020, pp.731-752. 10.1145/3394486.3403117.
  • 14.Polishchuk PG, Madzhidov TI, Varnek A. Estimation of the size of drug-like chemical space based on GDB-17 data. Journal of Computer Aided Molecular Design. 2013;27(8):675–679. doi: 10.1007/s10822-013-9672-4. [DOI] [PubMed] [Google Scholar]
  • 15.Macarron R, Banks MN, Bojanic D, Burns DJ, Cirovic DA, Garyantes T, Green DVS, Hertzberg RP, Janzen WP, Paslay JW, Schopfer U, Sittampalam GS. Impact of high-throughput screening in biomedical research. Nature Reviews Drug Discovery. 2011;10(3):188–195. doi: 10.1038/nrd3368. [DOI] [PubMed] [Google Scholar]
  • 16.Pyzer-Knapp EO, Suh C, Gómez-Bombarelli R, Aguilera-Iparraguirre J, Aspuru-Guzik A. What is high-throughput virtual screening? A perspective from organic materials discovery. Annual Review of Materials Research. 2015;45:195–216. doi: 10.1146/annurev-matsci-070214-020823. [DOI] [Google Scholar]
  • 17.Goodfellow I J, PougetAbadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In Proc. the 27th International Conference on Neural Information Processing Systems, December 2014, pp.2672-2680.
  • 18.Kingma D P, Welling M. Auto-encoding variational bayes. arXiv: 1312.6114, 2013. https://arxiv.org/abs/1312.6114, Nov. 2022.
  • 19.Kipf T N, Welling M. Variational graph auto-encoders. arXiv: 1611.07308, 2011. https://arxiv.org/abs/1611.073-08, Nov. 2022.
  • 20.Grover A, Zweig A, Ermon S. Graphite: Iterative generative modeling of graphs. In Proc. the 36th International Conference on Machine Learning, May 2019, pp.2434-2444.
  • 21.Simonovsky M, Komodakis N. GraphVAE: Towards generation of small graphs using variational autoencoders. In Proc. the 27th International Conference on Artificial Neural Networks, Oct. 2018, pp.412-422.
  • 22.You J X, Ying R, Ren X, Hamilton W L, Leskovec J. GraphRNN: Generating realistic graphs with deep autoregressive models. In Proc. the 35th International Conference on Machine Learning, Jul. 2018, pp.5694-5703.
  • 23.Liao R J, Li Y J, Song Y, Wang S L, Hamilton W L, Duvenaud D, Urtasun R, Zemel R. Efficient graph generation with graph recurrent attention networks. arXiv: 1910.00760, 2019. https://arxiv.org/abs/1910.00760, Oct. 2019.
  • 24.You J X, Wu H Z, Barrett C, Ramanujan R, Leskovec J. G2SAT: Learning to generate SAT formulas. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2019, pp.10552-10563. [PMC free article] [PubMed]
  • 25.Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernández-Lobato JM, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science. 2018;4(2):268–276. doi: 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Modeling. 1988;28(1):31–36. doi: 10.1021/ci00057a005. [DOI] [Google Scholar]
  • 27.Samanta B, De A, Jana G, Chattaraj P K, Ganguly N, Rodriguez M G. NeVAE: A deep generative model for molecular graphs. In Proc. the 33rd AAAI Conference on Artificial Intelligence, Jul. 2019, pp.1110-1117. 10.1609/aaai.v33i01.33011110.
  • 28.Jin W G, Barzilay R, Jaakkola T S. Junction tree variational autoencoder for molecular graph generation. In Proc. the 35th International Conference on Machine Learning, Jul. 2018, pp. 2328-2337.
  • 29.Sutton R S, Barto A G. Reinforcement Learning: An Introduction. MIT Press, 2018.
  • 30.Alperstein Z, Cherkasov A, Rolfe J T. All SMILES variational autoencoder. 1905.13343, 2019. https://arxiv.org/abs/1905.13343, Nov. 2022.
  • 31.Yoshikawa N, Terayama K, Sumita M, Homma T, Oono K, Tsuda K. Population-based de novo molecule generation, using grammatical evolution. Chemistry Letters. 2018;47(11):1431–1434. doi: 10.1246/cl.180665. [DOI] [Google Scholar]
  • 32.Jensen JH. A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space. Chemical Science. 2019;10(12):3567–3572. doi: 10.1039/C8SC05372C. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Nigam A, Friederich P, Krenn M, Aspuru-Guzik A. Augmenting genetic algorithms with deep neural networks for exploring the chemical space. In Proc. the 8th International Conference on Learning Representations, April 2020, pp.250-256.
  • 34.Banzhaf W, Nordin P, Keller R E, Francone F D. Genetic Programming: An Introduction on the Automatic Evolution of Computer Programs and Its Application. Morgan Kaufmann Publishers, 1998.
  • 35.Kim Y, Kim WY. Universal structure conversion method for organic molecules: From atomic connectivity to three-dimensional geometry. Bulletin of the Korean Chemical Society. 2015;36(7):1769–1777. doi: 10.1002/bkcs.10334. [DOI] [Google Scholar]
  • 36.Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG. ZINC: A free tool to discover chemistry for biology. Journal of Chemical Information and Modeling. 2012;52(7):1757–1768. doi: 10.1021/ci3001277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Coley CW, Green WH, Jensen KF. RDChiral: An RDKit wrapper for handling stereochemistry in retrosynthetic template extraction and application. Journal of Chemical Information and Modeling. 2019;59(6):2529–2537. doi: 10.1021/acs.jcim.9b00286. [DOI] [PubMed] [Google Scholar]
  • 38.Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of Cheminformatics, 2009, 1: Article No. 8. 10.1186/1758-2946-1-8. [DOI] [PMC free article] [PubMed]
  • 39.Bickerton GR, Paolini GV, Besnard J, Muresan S, Hopkins AL. Quantifying the chemical beauty of drugs. Nature Chemistry. 2012;4(2):90–98. doi: 10.1038/nchem.1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Zhou ZP, Kearnes S, Li L, Zare RN, Riley P. Optimization of molecules via deep reinforcement learning. Scientific Reports. 2019;9(1):10752. doi: 10.1038/s41598-019-47148-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Bleicher KH, Böhm HJ, Müller K, Alanine AI. Hit and lead generation: Beyond high-throughput screening. Nature Reviews Drug Discovery. 2003;2(5):369–378. doi: 10.1038/nrd1086. [DOI] [PubMed] [Google Scholar]
  • 42.Jin W G, Yang K, Barzilay R, Jaakkola T. Learning multimodal graph-to-graph translation for molecular optimization. arXiv: 1812.01070, 2018. https://arxiv.org/abs/181-2.01070, Nov. 2022.
  • 43.Assouel R, Ahmed M, Segler M H, Saffari A, Bengio Y. DEFactor: Differentiable edge factorization-based probabilistic graph generation. arXiv: 1811.09766, 2018. https://arxiv.org/abs/1811.09766, Nov. 2022.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ESM 1 (107.4KB, pdf)

(PDF 107 kb)


Articles from Journal of Computer Science and Technology are provided here courtesy of Nature Publishing Group

RESOURCES