Abstract
Natural language processing models have emerged that can generate useable software and automate a number of programming tasks with high fidelity. These tools have yet to have an impact on the chemistry community. Yet, our initial testing demonstrates that this form of artificial intelligence is poised to transform chemistry and chemical engineering research. Here, we review developments that brought us to this point, examine applications in chemistry, and give our perspective on how this may fundamentally alter research and teaching.
Natural language processing models have emerged that can generate useable software and automate a number of programming tasks with high fidelity.
In 2021, Chen et al. released a new natural language processing (NLP) model called Codex that can generate code from natural language prompts.1 Interest has been broadly focused on its application to software engineering. We, somewhat sarcastically, asked it to “compute the dissociation curve of H2 using pyscf”2 and the result is shown in Fig. 1. It generated correct code and even plotted it (see ESI† for further details). Some may scoff at the artificial intelligence (AI) selected method (Hartree–Fock) and basis set (STO-3G). Thus, we asked it to “use the most accurate method” as a continuation of our “conversation” and it switched to CCSD in a large basis. AI models that can connect natural language to programming will have significant consequences for the field of chemistry—here we outline a brief history of these models and our perspective on where these models will take us.
Recent developments
There has been a flurry of advances in the topic of “autocomplete” style language models that can generate text given a prompt. These language models are deep neural networks with a specific architecture called transformers.3,4 These models are trained on text that has words hidden,5 and have the task of filling in missing text.4,6,7 This is called “pre-training,” because these models were not intended to fill in missing words, but rather be used on downstream tasks like classifying sentiment in text or categorizing text.4 Surprisingly, it was found that these models could generate a long seemingly real passage of text simply from a short initial fragment of text called a prompt.4,8 These prompts can be to answer a question, summarize a story, or make an analogy—all with the same model. This was interesting, especially because the quality was beyond previous text generation methods like recurrent neural networks or hidden Markov models.9 After increasing model size and the training corpus, the next generation of language models were able to answer novel prompts beyond standard question-and-answer or writing summaries.10 For example, given three worked out examples of extracting compound names from a sentence, the GPT-3 model could do the same for any new sentence. We show the utility of this for parsing chemistry literature in Fig. 2 using text from ref. 11 (see ESI† for full details). This result is remarkable because it requires no additional training, just the input prompt–literally a training size of 3. Not so long ago, this was considered a difficult problem even when using thousands of training examples.12 A caveat to these large language models (LLMs) is that they have a limited understanding of the text which they parse or generate; for example, we find they can generate seemingly valid chemistry text but cannot answer simple questions about well known chemical trends.
After these new LLMs were developed, anyone could have state-of-the art performance on language tasks simply by constructing a few examples of their task. In the last few months, even the need for worked out examples can be removed. In some cases a simple ‘imperative’ sentence is enough.13 For example, a variation on the name of this article was generated by asking an imperative-style model to “write an exciting title” given an earlier version of the abstract. The pace has been nothing short of remarkable, going from the transformer in 2017 to a near universal language model in 2020 to a model which can take instructions in 2021.
The largest and arguably most accurate model in this class is still the GPT-3 model from OpenAI.10 GPT-3 is an enigma in the field of natural language models. It is democratizing because anyone can create a powerful language model in a few hundred characters that is deployable immediately. Yet its weights are a pseudo-trade secret, owned and licensed by OpenAI exclusively to Microsoft. Thus the only way to run it is via their website (or API). These kinds of models are known as Large Language Models. Any state-of-the-art language models should start with a LLM like GPT-3 or, for example, the freely available GPT-NEO.14 GPT-3 has been trained on billions of tokens and no effort has yet to match its scale of training data and model size. It can be unsettling too because it has quite adeptly captured the racism, sexism, and bias in human writing and can be reflected in its responses.15 Mitigating this is an ongoing effort.16 Another interesting outcome is that “prompt engineering,” literally learning to interface more clearly with an AI, is now a research topic.17
GPT-3 has yet to make a major impact on chemistry, likely because it was available starting only in 2021. We previously prepared a demo of voice-controlled molecular dynamics analysis using GPT-3 to convert natural language into commands.18 Although an impressive example of voice controlled computational chemistry had been published using Amazon's Alexa,19 we found in our work that GPT-3 could handle looser prompts such “wait, actually change that to be ribbons.” It also took only about a dozen examples to teach GPT-3 how to do tasks like render a protein, change its representation, and select specific atoms using VMD's syntax.20 This is a significant reduction in researcher effort to make such tools, only taking a few hours total between the two of us. Our program itself adds an element of accessibility for those who may have difficulty with a keyboard and mouse interface through this voice-controlled interface, and we could easily, and plan to, generalize this approach to other analysis software used in our groups.
Perhaps because programmers were the most excited about GPT-3, frequent usage examples involved the generation of code. And thus we reach the present, with OpenAI's release in August of a GPT-3 model tuned explicitly for this purpose, termed Codex.1 Although automatic code generation in chemistry is not new (e.g. ref. 21–23), we believe that the scope and natural language aspects mean that code-generating LLMs like Codex will have a broad impact on both the computational and experimental chemistry community. Furthermore, Codex is just the first capable model and progress will continue. Already in late 2021 there are models that surpass GPT-3 in language24 and equal it but with 1/20th the number of parameters.25
Over time, there has been a tremendous increase in the number of available software packages to perform computational chemistry tasks. These off-the-shelf tools can enable students to perform tasks in minutes which might have taken a large portion of their PhD to complete just ten years ago. Yet now, a large fraction of a researcher's time that used to be spent on repetitive coding tasks has been replaced by learning the interfaces to these numerous software packages; this task is currently done by a combination of searching documentation pages on the web, reading and following tutorial articles, or simply by trial and error. These new NLP models are able to eliminate intermediate steps and allow researchers to get on with their most important task, which is research! Some successful examples we have tried are shown in Fig. 3, with full details in the ESI.† While reading these examples, remember that the model does not have a database or access to a list of chemical concepts. All chemistry knowledge, like the SMILES string for caffeine in example A, is entirely contained in the learned floating point weights. Moreover, keep in mind that Codex may produce code that is apparently correct and even executes, but which does not follow best scientific practice for a particular type of computational task.
Immediate impact on research and education
Scientific software
Many scientific programming tasks, whether for data generation or data analysis, are tedious and often repetitive over the course of a long research project. Codex can successfully complete a wide range of useful scientific programming tasks in seconds with natural language instructions, greatly reducing time to completion of many common tasks. These could include writing a function to convert between two different file formats, producing well formatted plots with properly labeled axes, converting LATEX equations into a function, implementing standard algorithms such histogramming, adding comments to code, and converting code from one programming language to another.1 We have even found that Codex is capable of performing some of these tasks using non-english prompts, which could help reduce barriers to accessing software libraries faced by non-native speakers—although result accuracy when using non-English prompts has not been fully explored. Codex is not always successful. However, the rapid pace of progress in this field shows that we should begin to think seriously about these tasks being solved.
Will using code from Codex make chemists better or worse programmers? We think better. Codex removes the tedium of programming and lets chemists focus the high-level science enabled with programs. Furthermore, the process of creating a prompt string, mentally checking whether it seems reasonable, testing that code on a sample input, and then iterating by breaking down the prompt string into simpler tasks will result in better algorithmic thinking by chemists. The code generated, if not guaranteed to be correct, at least satisfies common software coding conventions with clear variable names, and typically employs relevant software libraries to simplify complex tasks. We ourselves have learned about a number of existing chemistry software libraries that we would not have discovered otherwise through our iterative prompt creation. Note though that Codex does not need to have a priori knowledge of how to use your software of interest; API usage can be suggested as part of the prompt similar to how the task is defined in Fig. 2.
Classroom settings
We and many of our colleagues around the world have begun introducing programming assignments as a component of our courses (especially in physical chemistry);26 this has dual pedagogical purposes of reinforcing the physical meaning underlying the equations we scribble on the board, and teaching our students a skill that is useful both for research and on the job market. One of us has even written a book on deep learning in chemistry and materials science based around this concept.27 But will code generation models result in poor academic honesty, especially when standard problems can be solved in a matter of seconds (Fig. 3)? Realistically we have few methods to police our students' behavior in terms of collaborating on programming assignments or copying from web resources. We rely, at least in part, on their integrity. We should rethink how these assignments are structured. Firstly, we currently limit the difficulty of programming assignments to align with the median programming experience of a student in our course. Perhaps now we can move towards more difficult and compound assignments. Secondly, we can move towards thinking of these assignments as a laboratory exercise, where important concepts can be explored using the software rather than concentrating on the process of programming itself. Lastly, our coursework and expectations should match the realities of what our students will face in their education and careers. They will always have access to web resources and, now, tools like Codex. We should embrace the fact that we no longer need to spend hours emphasizing the details of syntax, and instead focus on higher level programming concepts and on translating ideas from chemistry into algorithms.
Ongoing challenges
Access and price
Currently, access to advanced models from OpenAI and tools like GitHub copilot are limited to users accepted into an early tester program. Pricing from the GPT-3 model by OpenAI indicates a per-query cost that is directly proportional to the length of the input prompt, typically on the order of 1–3 cents per query. This model may of course change, but it is reasonable to expect that Codex will not be free until either there are competing open-source models or the hardware required for inference drops in price. Depending on this cost structure, these commercial NLP models may be inaccessible to the academic community, or to all but the best funded research groups and universities. For example, a group might need to run hundreds of thousands of queries to parse through academic literature and tens of thousands for students in a medium size course, and these would certainly be cost prohibitive. Models developed by the open source community currently lag commercial ones in performance, but are freely useable, and will likely be the solution taken up in many areas of academia. However, even these models require access to significant computational resources to store and execute the models locally, and so we encourage the deployment of these models by researchers who have such computational resources in a way in which they can be equitably available.
Correctness
Code generation models do not guarantee correctness. Codex typically generates correct code at about a 30% rate on a single solution on standard problems, but improves to above 50% if multiple solutions are tried.1 In practice, we find that mistakes occur when a complex algorithm is requested with little clarity. Iterating by breaking a prompt into pieces, chaining together prompts into a dialogue, and giving additional clues like a function signature or imports usually yields a solution. The code generated rarely has syntax mistakes, but we find it fails in obvious ways (such as failing to import a library, or expecting a different data type to be returned by a function). Over-reliance on AI-generated code without careful verification could result in a loss of trust in scientific software and the analysis performed in published works. However, this is already an issue in scientific programming and strategies to assess correctness of code apply equally to human and AI-generated code. Interestingly, Codex can generate unit tests for code, although it is not clear that this strategy can identify its own mistakes.
Because the accuracy of Codex depends strongly on how the prompts are phrased, it remains unclear how accurate it can be for chemistry problems. We are currently developing a database of chemistry and chemical engineering examples that can be used to systematically evaluate LLM performance in these and related domains. A second question remains as to whether the code produced is scientifically correct (and best practice when multiple solutions exist) for a given task, which will still require expert human knowledge to verify for now. We also note that in practice some of the correctness is ensured by default settings of chemistry packages employed in the Codex solution, just as they might be with human generated code.
Fairness/bias
As discussed in the Codex paper,1 there are a number of possible issues related to fairness and bias which could accrue over time. The use of AI generated code, and then the updated training of that AI on the new code, could lead to a focus on a narrow range of packages, methods, or programming languages. For example, Python is already pushing out other programming languages in computational chemistry and this could increase due to the performance of Codex in Python over languages like Fortran or Julia. Another example we noticed is the preference of Codex to generate code using certain popular software libraries, which could lead to consolidation of use. For example, a single point energy calculation shown in the ESI† selects the package Psi4 if the model is not prompted to use a particular software.
Outlook
There are many exciting ways in which AI techniques are being integrated into chemistry research [ref. 28–30]. Bench chemists have expressed the fear that automation will reduce the need for synthetic hands in the lab.31 Now it looks like these NLP models could reduce the need for computational chemists even sooner. We disagree in both cases. Better tools have not reduced the need for scientists over time, but rather expanded the complexity of problems that can be tackled by a single scientist or a team in a given amount of time. Despite the challenges in the previous section, we foresee the use of NLP models in chemistry increasing accessibility of software tools, and greatly increasing the scope of what a single research group can accomplish.
Data availability
All prompts and multiple response are presented in the ESI.† Code was executed in Python 3.8. Access to OpenAI Codex and GPT-3 is governed by OpenAI and not the authors.
Conflicts of interest
There are no conflicts to declare.
Supplementary Material
Acknowledgments
Research reported in this work was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM137966 (to ADW) and R35GM138312 (to GMH). We thank John D. Chodera for a helpful discussion on Twitter of how some code examples could be seemingly correct while producing poor or incorrect answers if it is not checked that a proper version of an algorithm is employed.
Electronic supplementary information (ESI) available. See DOI: 10.1039/d1dd00009h
References
- Chen M., Tworek J., Jun H., Yuan Q., Ponde H., Kaplan J., Edwards H., Burda Y., Joseph N. and Brockman G., et al., Evaluating large language models trained on code, arXiv:2107.03374, 2021 [Google Scholar]
- Sun Q. Berkelbach T. C. Blunt N. S. Booth G. H. Guo S. Li Z. Liu J. McClain J. D. Sayfutyarova E. R. Sharma S. et al., Pyscf: the python-based simulations of chemistry framework. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2018;8:e1340. [Google Scholar]
- Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł. and Polosukhin I., Attention is all you need, in Advances in neural information processing systems, 2017, pp. 5998–6008 [Google Scholar]
- Devlin J., Chang M.-W., Lee K. and Toutanova K., Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv:1810.04805, 2018 [Google Scholar]
- More generally, “tokens” are masked
- Taylor W. L. “cloze procedure”: A new tool for measuring readability. Journalism quarterly. 1953;30:415. doi: 10.1177/107769905303000401. [DOI] [Google Scholar]
- Dai A. M. Le Q. V. Semi-supervised sequence learning. Advances in Neural Information Processing Systems. 2015;28:3079. [Google Scholar]
- Radford A. Wu J. Child R. Luan D. Amodei D. Sutskever I. et al., Language models are unsupervised multitask learners. OpenAI blog. 2019;1:9. [Google Scholar]
- Sutskever I., Martens J., and Hinton G. E., Generating text with recurrent neural networks, in ICML, 2011 [Google Scholar]
- Brown T. B., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., Neelakantan A., Shyam P., Sastry G. and Askell A., et al., Language models are few-shot learners, arXiv preprint arXiv:2005.14165, 2020 [Google Scholar]
- Hueckel T. Hocky G. M. Palacci J. Sacanna S. Ionic solids from common colloids. Nature. 2020;580:487. doi: 10.1038/s41586-020-2205-0. [DOI] [PubMed] [Google Scholar]
- Krallinger M. Leitner F. Rabal O. Vazquez M. Oyarzabal J. Valencia A. Chemdner: The drugs and chemical names extraction challenge. J. Cheminf. 2015;7:1. doi: 10.1186/1758-2946-7-S1-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Unpublished, but part of ongoing work known as davinci-instruct GPT-3 variant
- Black S., Gao L., Wang P., Leahy C. and Biderman S., GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, 2021, If you use this software, please cite it using these metadata [Google Scholar]
- Bender E. M., Gebru T., McMillan-Major A. and Shmitchell S., On the dangers of stochastic parrots: Can language models be too big?, in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021, pp. 610–623 [Google Scholar]
- Solaiman I. and Dennison C., Process for adapting language models to society (palms) with values-targeted datasets, arXiv preprint arXiv:2106.10328, 2021 [Google Scholar]
- Reynolds L. and McDonell K., Prompt programming for large language models: Beyond the few-shot paradigm, in Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–7 [Google Scholar]
- https://github.com/whitead/marvis. https://github.com/whitead/marvis
- Raucci U. Valentini A. Pieri E. Weir H. Seritan S. Martínez T. J. Voice-controlled quantum chemistry. Nat. Comp. Sci. 2021;1:42. doi: 10.1038/s43588-020-00012-9. [DOI] [PubMed] [Google Scholar]
- Humphrey W. Dalke A. Schulten K. Vmd: visual molecular dynamics. J. Mol. Graphics. 1996;14:33. doi: 10.1016/0263-7855(96)00018-5. [DOI] [PubMed] [Google Scholar]
- MacLeod M. K. Shiozaki T. Communication: Automatic code generation enables nuclear gradient computations for fully internally contracted multireference theory. J. Chem. Phys. 2015;142:051103. doi: 10.1063/1.4907717. [DOI] [PubMed] [Google Scholar]
- Austin J., Odena A., Nye M., Bosma M., Michalewski H., Dohan D., Jiang E., Cai C., Terry M. and Le Q., et al., Program synthesis with large language models, arXiv preprint arXiv:2108.07732, 2021 [Google Scholar]
- Zirwes T., Zhang F., Denev J. A., Habisreuther P. and Bockhorn H., Automated code generation for maximizing performance of detailed chemistry calculations in openfoam, in High Performance Computing in Science and Engineering’17, Springer, 2018, pp. 189–204 [Google Scholar]
- Rae J. W., Borgeaud S., Cai T., Millican K., Hoffmann J., Song F., Aslanides J., Henderson S., Ring R. and Young S., et al., Scaling language models: Methods, analysis & insights from training gopher, arXiv preprint arXiv:2112.11446, 2021 [Google Scholar]
- Borgeaud S., Mensch A., Hoffmann J., Cai T., Rutherford E., Millican K., Driessche G. v. d., Lespiau J.-B., Damoc B. and Clark A., et al., Improving language models by retrieving from trillions of tokens, arXiv preprint arXiv:2112.04426, 2021 [Google Scholar]
- Ringer McDonald A., Teaching programming across the chemistry curriculum: A revolution or a revival?, in Teaching Programming across the Chemistry Curriculum, ACS Publications, 2021, pp. 1–11 [Google Scholar]
- White A. D., Deep Learning for Molecules and Materials, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keith J. A. Vassilev-Galindo V. Cheng B. Chmiela S. Gastegger M. Müller K.-R. Tkatchenko A. Combining machine learning and computational chemistry for predictive insights into chemical systems. Chem. Rev. 2021;121:9816. doi: 10.1021/acs.chemrev.1c00107. doi: 10.1021/acs.chemrev.1c00107. [DOI] [PMC free article] [PubMed] [Google Scholar]; pMID: 34232033.
- Artrith N. Butler K. T. Coudert F.-X. Han S. Isayev O. Jain A. Walsh A. Best practices in machine learning for chemistry. Nat. Chem. 2021;13:505. doi: 10.1038/s41557-021-00716-z. [DOI] [PubMed] [Google Scholar]
- Pollice R. dos Passos Gomes G. Aldeghi M. Hickman R. J. Krenn M. Lavigne C. Lindner-D’Addario M. Nigam A. Ser C. T. Yao Z. et al., Data-driven strategies for accelerated materials design. Acc. Chem. Res. 2021;54:849. doi: 10.1021/acs.accounts.0c00785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chemjobber Will robots kill chemistry? Chem. Eng. News. 2019;97(15):25–25. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All prompts and multiple response are presented in the ESI.† Code was executed in Python 3.8. Access to OpenAI Codex and GPT-3 is governed by OpenAI and not the authors.