Environmental Impacts of Machine Learning Applications in Protein Science

Loïc Lannelongue; Michael Inouye

doi:10.1101/cshperspect.a041473

. 2023 Dec;15(12):a041473. doi: 10.1101/cshperspect.a041473

Environmental Impacts of Machine Learning Applications in Protein Science

Loïc Lannelongue ^1,^2,^3,^4,^✉, Michael Inouye ^1,^2,^3,^4,^5,⁶

PMCID: PMC10691472 PMID: 38040454

Abstract

Computing tools and machine learning models play an increasingly important role in biology and are now an essential part of discoveries in protein science. The growing energy needs of modern algorithms have raised concerns in the computational science community in light of the climate emergency. In this work, we summarize the different ways in which protein science can negatively impact the environment and we present the carbon footprint of some popular protein algorithms: molecular simulations, inference of protein–protein interactions, and protein structure prediction. We show that large deep learning models such as AlphaFold and ESMFold can have carbon footprints reaching over 100 tonnes of CO₂e in some cases. The magnitude of these impacts highlights the importance of monitoring and mitigating them, and we list actions scientists can take to achieve more sustainable protein computational science.

Algorithms, computer simulations, and machine learning models are now an essential part of protein science. Statistical modeling was already used in the early 2000s to predict protein–protein interactions in yeast (Jansen et al. 2003), and the following 20 years saw algorithmic complexity increase in line with hardware and software abilities. AlphaFold (Jumper et al. 2021) is one of the recent examples; released in 2021 with millions of trainable parameters, it marked a leap forward for predicted protein structures. As a consequence of increased complexity, it is not uncommon for models to run for hours using hundreds, if not thousands, of processing cores.

While there is no doubt that such technological developments have enabled impressive discoveries, the contribution of these models to the climate emergency has raised concerns in recent years (Schwartz et al. 2020; Bender et al. 2021; Lannelongue et al. 2021a, 2023). High-performance computing has tangible impacts on the environment—mostly energy consumption but also water usage and ecological consequences—which raise ethical dilemmas for computational biologists: how to balance trying to cure diseases with the health impacts of climate change, partly fueled by large data centers?

Data centers’ global greenhouse gas (GHG) emissions are estimated to be ∼100 × 10⁶ tonnes of CO₂e per year, similar to American commercial aviation (Lannelongue et al. 2021a). Estimates for the yearly carbon footprint of a scientist range between 4 and 37 tCO₂e (Stevens et al. 2020; ALLEA 2022; Knödlseder et al. 2022), far greater than the upper bound of 2 tCO₂e per person set by the International Panel on Climate Change (IPCC) to keep global warming under 1.5°C (Arias et al. 2021). Statistics from XSEDE (a now-concluded network of American research institutes) show the scale of computing in biology: in 2020, 586 million compute hours were dedicated to biochemistry or molecular structure and function (XSEDE Impact—usage statistics, portal.xsede.org/#/gallery).

The GHG emissions arising from computations need acknowledging and mitigating when possible. In this work, we highlight the main environmental impacts of computations used in protein science, and we explore the carbon footprint of some popular algorithms from molecular simulation, protein–protein interactions, and protein structure prediction.

THE ENVIRONMENTAL IMPACTS OF COMPUTATIONS

The standardized metric for carbon footprint is a quantity (usually in grams) of CO₂-equivalent (gCO₂e), which summarizes the environmental impact of GHG emissions with just one number. One of the challenges is that each gas has different impacts on climate change and different lifetimes; for example, it is estimated that over 100 years, 1 kg of methane will have the same impact on global warming as 28 kg of carbon dioxide. This is accounted for by giving methane a global warming potential (GWP₁₀₀) of 28 (Myhre et al. 2013) (the GWP of carbon dioxide is 1 by definition). The final carbon footprint is calculated by weighting each GHG by its GWP₁₀₀. For example, a mix of 4 kg of carbon dioxide and 2 kg of methane will have a carbon footprint of 4 × 1 + 2 × 28 = 60 kgCO₂e. Generally, the GHGs considered are the ones included in the Kyoto basket, namely, carbon dioxide (CO₂), methane (CH₄), and nitrous oxide (N₂O) (Hill et al. 2020), which constitute 97.9% of total GHG emissions (Our World in Data 2017). Notably, this definition of carbon footprint does not consider all environmental impacts, such as water usage, impact on wildlife, etc., and the values of the GWPs are debated as they may misestimate the impact of short-lived climate pollutants such as methane (Allen et al. 2018).

Each stage of the hardware's life cycle results in some environmental impacts. Starting with manufacturing, when the mining of raw materials, assembly, and shipping alone can account for 70% to 90% of the total footprint in the case of smartphones and laptops (Clément et al. 2020; Apple Environmental Reports 2023; Dell Technologies 2023). Although a lower percentage for data center hardware, with between 15% and 40% of the total impacts due to manufacturing (Dell Technologies 2023), the environmental costs remain substantial. Disposing of technological wastes (e-waste) is also responsible for considerable environmental impacts—water, air, and soil pollution—notwithstanding consequences for the health of waste workers. A recent report by the World Health Organization (2021) found that more than 82% of the 53.6 million tonnes of e-waste were not processed formally or recycled. Informal waste processing involves between 12 and 56 million people worldwide, including millions of children. In dump sites, often in low- and middle-income countries, informal waste workers are exposed to a range of toxic by-products and hazardous compounds, such as mercury, lead, cadmium, and other heavy metals.

The rest of the environmental impacts of computing comes from usage, mainly through electricity consumption, but also water use and the ecological impact of the facilities. The carbon footprint from energy usage (C in gCO₂e) is the easiest aspect to quantify at the scale of one analysis or project. It depends on how much energy is needed (E in kWh) and how this energy is produced, called carbon intensity (CI in gCO₂e/kWh) (Lannelongue et al. 2021a):

C = E \times C I,

The energy needed can be estimated by focusing on runtime (t), power draw from processors (P_p), and memory (P_m) and the efficiency of the data center (PUE). PUE, which stands for power usage effectiveness, is the ratio between the total power delivered to the facility and the power used by the servers and is a measure of overheads, mostly cooling. Lannelongue et al. (2021a) can be consulted for more details on this equation.

E = t \times (P_{p} + P_{m}) \times P U E,

As most data centers are powered by the general power grid, the carbon intensity depends on the energy production methods where the hardware is located. Because of differences between production methods, there is a wide variation between countries, with up to 3 orders of magnitude between Iceland (0.10 gCO₂e/kWh) and Australia (770 gCO₂e/kWh) (Fig. 1).

Figure 1. — 2022 carbon intensity of electricity usage by country. Created using data from Carbon Footprint (2023). The world average value is available in the Global Energy & CO₂ Status Report 2019 (IEA 2019).

Data storage tends to be considered separately from computations, as it is typically low power but constant over multiple years. The order of magnitude of the carbon footprint of storing one terabyte of data for one year is ∼10 kgCO₂e (Nguyen et al. 2020; Seagate 2023). However, there can be great variations between hardware options. Just looking at Seagate's hard drives, cradle-to-grave carbon footprints range from 2 kgCO₂e/TB/year to 66 kgCO₂e/TB/year (Seagate 2023).

Moreover, different phases in computational projects may need to be assessed separately. Machine learning pipelines generally involve a research phase to identify and fine-tune the best model, followed by a training phase to build the final product. Then the model can be used to make predictions at scale (inference). Depending on how heavily used the final model is, most energy needs may come from training or prediction.

Estimating Carbon Footprints in Practice

Different tools exist to estimate the carbon footprint of computations: online calculators (e.g., Green Algorithms [Lannelongue et al. 2021a] or MLCO2 [Lacoste et al. 2019]), dedicated packages in Python to track energy usage (e.g., Carbon Tracker [Anthony et al. 2020] and Code Carbon, codecarbon.io]) and server-side tools to monitor usage in data centers (e.g., GA4HPC [Green Algorithms 2023], Amazon Web Services’ dashboard on the cloud [AWS 2023]). The pros and cons of each approach are summarized in Lannelongue and Inouye (2023). In summary, task-agnostic, general-purpose calculators such as online tools can be used across all fields of computational science but require the user to manually input information such as runtime and memory usage. Task-specific tools integrate tightly with the existing code to avoid this (e.g., Python packages for machine learning [Anthony et al. 2020]; Code Carbon, codecarbon.io [Henderson et al. 2020]), but different tools need to be built for each task. The tools mentioned so far are all user-based (either integrated into the analysis pipeline or by inputting values in a calculator a posteriori). An alternative is to track usage from the server side (i.e., a calculator located on the computing platform and tracking usage continuously). Instead of being task-specific, this approach is platform-specific, particularly adapted to computations based in high-performance computing facilities. When feasible, it can address the limitations of the previously mentioned approaches. As part of the Green Algorithms Initiative, GreenAlgorithms4HPC (Green Algorithms 2023) is one example of such a tool, and some cloud providers integrate similar tools into their dashboards (AWS 2023). Recent studies investigated the accuracy and reliability of some of these tools and found that estimates were overall consistent and in line with real power consumption, particularly when training deep learning models (Bouza Heguerte et al. 2023; Jay et al. 2023).

About Experimental Work

Although not the focus of this work, computational tools rely heavily on data from experimental methods, either for training or validation. Similar to computations, part of the footprint of laboratories is from energy use. Estimates of the energy usage of laboratories range from three to ten times the energy needs of an equivalent size office, and the University of Oxford estimated that 60% of the university's GHG emissions are due to laboratory buildings (Royal Society of Chemistry 2022). For example, an ultralow temperature (ULT) freezer and a fume hood can use as much energy as one to four typical households (My Green Lab 2023; Nature Portfolio 2023), with 16–22 kWh/day for a ULT freezer (University of Exeter 2023) and ∼75 kWh/day for a fume hood (Berkeley Lab 2023). For these simple actions such as closing the sashes of a fume hood or increasing a ULT freezer from −80°C to −70°C can result in substantial energy savings (more than 40% and 30%–40%, respectively) (Royal Society of Chemistry 2022). Laboratory work also comes with heavy usage of water and chemicals, as well as single-use plastic (Labconscious 2020). It is estimated that each year, research laboratories produce ∼5.5 million tonnes of plastic waste (Urbina et al. 2015), an autoclave can use ∼270 L of water per cycle and an ultrapure water purification system discards 80% of the water input (My Green Lab 2023).

ALGORITHMS AND PROTEIN SCIENCE

In this section, we give a series of examples of the carbon footprint of popular algorithms and machine learning applications in protein science (Fig. 2). The list is not exhaustive, so we would encourage readers to investigate the carbon footprint of models in their field of expertise that we may not have discussed here.

Figure 2. — Energy usage and carbon footprint of the main use cases presented here (log scale). Carbon footprints are in kilograms of CO₂e and assume an average carbon intensity of 0.475 kgCO₂e/kWh.

Molecular Simulations

Computer simulations are key tools to understand how different components interact together, with structure-based drug discovery one of its successful applications. Molecular docking methods predict how compounds (e.g., proteins) are likely to bind to each other and, to do so, require significant computing power due to the complexity of the task. Grealey et al. (2022) used a benchmark that studied a one million ligand campaign from the Directory of Useful Decoys (DUDs) (Ruiz-Carmona et al. 2014). The DUD benchmark set contains 39 protein–ligand complexes with crystal structure, with 100 active ligands per complex on average, each with 36 decoys. When comparing three methods, rDock (Ruiz-Carmona et al. 2014), AutoDock Vina (Trott and Olson 2009), and Glide (Friesner et al. 2004), and using world average carbon intensities, Grealey et al. (2022) found carbon footprints of 13 kgCO₂e for Glide, 154 kgCO₂e for rDock, and 514 kgCO₂e for AutoDock Vina (between 27 and 1082 kWh of energy). The latter runs for over 40,000 core hours, for example. While Glide seems to emit almost 40 times less GHGs than AutoDock Vina, it is worth noting that, in contrast to AutoDock Vina and rDock, Glide is not freely available.

Grealey et al. (2022) also looked at the computing requirements of simulating molecular dynamics of the Satellite Tobacco Mosaic Virus (one million atoms) for 100 nanoseconds and found the carbon footprint to be 18 kgCO₂e with Amber (ambermd.org/index.php) (75 kWh) and 95 kgCO₂e with NAMD (Phillips et al. 2020) (400 kWh); however, the two softwares use slightly different resolutions so a direct comparison is not straightforward.

Protein–Protein Interactions

In silico methods to predict protein–protein interactions have grown more popular in recent years. While earlier tools tended to rely on low-power machine learning algorithms, most methods now leverage deep learning and involve longer training times. A recent comparison of machine learning (random forest) and deep learning (recurrent neural networks) found that in some situations, the deep learning approach could emit 22,000 more GHGs for similar performance (Lannelongue and Inouye 2022). However, runtimes remain small and, in this case, training the deep learning model once had a carbon footprint of 356 gCO₂e and ∼36 kgCO₂e (75 kWh) when including fine-tuning the network.

Protein Structure Prediction: AlphaFold and ESMFold

Compared to the interaction prediction models above, other works on proteins rely on larger and more complex algorithms. Protein structure prediction from primary sequences has been a recent area of intense activity for deep learning networks in protein science. AlphaFold, released in 2021 by DeepMind (Google) (Jumper et al. 2021), was a significant leap forward toward solving the protein-folding problem. ESMFold was released a year later by Meta (Lin et al. 2023), claiming almost similar accuracy but faster inference. We will not assess the differences between the predicted structures but rather estimate the carbon footprints of training and predicting with these large neural networks based on numbers included in the original publications.

128 TPUv3 were used for 11 d to train and fine-tune AlphaFold once. Using the Green Algorithms calculator, this required 8.25 MWh of energy and would emit approximately 3.92 tCO₂e (using the average world carbon intensity and Google's best PUE of 1.11). Prediction runtimes, and therefore carbon footprints, vary greatly with protein length: it takes 9.2 min and 24 gCO₂e for a 384-residue protein but 3 kgCO₂e for a protein with 2500 residues (using the ensemble model).

Three versions of ESMFold were released, with 700 million, 3 billion, and 15 billion parameters. 512 NVIDIA v100 GPUs were used to train these models once; it took 8 d for the smaller one (700M), 30 d for the 3B one, and 60 d for the larger model. These long runtimes come with proportionally large carbon footprints. Using the same PUE and carbon intensity as above, training the larger 15B model required 246 MWh and emitted 117 tCO₂e. Training the 700M and 3B models emitted, respectively, 16 tCO₂e (33 MWh) and 58 tCO₂e (123 MWh). ESMFold could predict a structure from a 384-residue protein in 14.2 sec on one GPU, which corresponds to just 0.6 gCO₂e. Interestingly, while the training cost of ESMFold is higher than AlphaFold, the inference cost is lower, hinting at some trade-offs that need to be considered. In their publication, the authors present predictions for 620 million sequences from the MGnify90 database, which took 28,000 GPU days to complete (2 wk with a cluster of 2000 GPUs), requiring 224 MWh of energy and emitting 106 tCO₂e.

CONCLUSION

The algorithms presented in this work are examples of protein models with noticeable environmental impacts, up to 117 tCO₂e to train one of the ESMFold models. While many analyses not discussed here will have negligible carbon footprints, the existence of such complex models highlights the importance of monitoring, acknowledging, and reducing such impacts when possible.

There are a few mitigating factors to take into account when discussing these high-carbon algorithms. For example, training these large protein-folding models are meant as one-off costs; training the model once and making it available publicly alongside predicted structures prevents the unnecessary development of local models. Inference costs tend to then be significantly lower in most cases. For example, AlphaFold's team and EMBL's European Bioinformatic Institute have made 200 million structures available in a database and received over 700M API requests and 2.4M visitors in a year (Lannelongue et al. 2023). Pretrained foundation models like AlphaFold or ESMFold can also be downloaded and fine-tuned to tackle slightly different problems for a fraction of the environmental cost of training the full model (Motmaen et al. 2023). Moreover, it can also be useful to compare computational models to the corresponding experiments (e.g., X-ray crystallography for protein structures [Bertoline et al. 2023]). Simulations are generally thought to have lower financial and environmental costs, which is likely to be true but would need to be assessed in a detailed manner moving forward. Such open-access models present undeniable benefits, both for scientific progress and the environment. However, the release of AlphaFold in 2021, followed by RoseTTAFold a few months later, and then ESMFold shortly after shows that there is a risk of engaging in a race for always bigger models, similar to what has been seen with large language models recently (Bender et al. 2021). If not carefully considered, the environmental costs of developing each model may negate the expected environmental benefits. Recent works focusing on achieving similar performance with smaller models (e.g., DR-BERT for protein region annotation [Nambiar et al. 2023]) can be a promising way forward.

Computational scientists can do a number of things to be more sustainable while still actively engaging in machine learning research (Lannelongue et al. 2021b). The first thing is to estimate and monitor the impact of the algorithms used. Such estimations should ideally be done before starting a project to include the figures in cost-benefit analyses, and afterward to acknowledge and track carbon impacts. Using the most efficient tool for a task can also have great impacts. Two examples from the field of genome-wide association studies (GWASs): updating from v1 of Bolt-LMM to v2.3 can reduce energy usage, and GHG emissions, by 72% (Grealey et al. 2022). Switching from SAIGE to REGENIE (two other GWAS softwares) reduces carbon footprints by 85% and saving 2.4 tCO₂e according to a study from the team behind the REGENIE tool (Mbatchou et al. 2021). There is also scope to reduce carbon footprints by running computations when carbon intensity is low. For example, it has been shown that delaying machine learning tasks by up to 24 h can reduce GHG emissions by 10%–40% in the United States (even 80% in some areas) (Dodge et al. 2022). Finally, these actions are part of a wider context of more sustainable computational science encapsulated by the GREENER principles (governance, responsibility, estimation, energy and embodied impacts, new collaborations, education, and research) (Lannelongue et al. 2023), and individual actions will need to be accompanied by broader institutional support to ensure that the societal benefits of protein models outweigh their environmental costs.

ACKNOWLEDGMENTS

L.L. was supported by the University of Cambridge MRC DTP (MR/S502443/1) and the BHF programme grant (RG/18/13/33946). M.I. was supported by the Munz Chair of Cardiovascular Prediction and Prevention and the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014; NIHR203312). M.I. was also supported by the UK Economic and Social Research 878 Council (ES/T013192/1). This work was supported by core funding from the British Heart Foundation (RG/18/13/33946) and the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014; NIHR203312). The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. This work was also supported by Health Data Research UK, which is funded by the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health and Social Care (England), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), and British Heart Foundation and Wellcome.

Footnotes

Editors: Peter K. Koo, Christian Dallago, Ananthan Nambiar, and Kevin K. Yang

Additional Perspectives on Machine Learning for Protein Science and Engineering available at www.cshperspectives.org

REFERENCES

ALLEA. 2022. Towards climate sustainability of the academic system in Europe and beyond. ALLEA, Berlin. 10.26356/climate-sust-acad [DOI] [Google Scholar]
Allen MR, Shine KP, Fuglestvedt JS, Millar RJ, Cain M, Frame DJ, Macey AH. 2018. A solution to the misrepresentations of CO2-equivalent emissions of short-lived climate pollutants under ambitious mitigation. NPJ Clim Atmos Sci 1: 16. 10.1038/s41612-018-0026-8 [DOI] [Google Scholar]
Anthony LFW, Kanding B, Selvan R. 2020. Carbontracker: tracking and predicting the carbon footprint of training deep learning models. arXiv 10.48550/arXiv.2007.03051 [DOI] [Google Scholar]
Apple Environmental Reports. 2023. A first for Apple. https://www.apple.com/environment
Arias PA, Bellouin N, Coppola E, Jones RG, Krinner G, Marotzke J, Naik V, Palmer MD, Plattner GK, Rogelj J, et al. 2021. Technical summary: in climate change 2021: the physical science basis. Contribution of working group I to the sixth assessment report of the intergovernmental panel on climate change, pp. 33–144. Cambridge University Press, Cambridge. 10.1017/9781009157896.002 [DOI] [Google Scholar]
AWS. 2023. Customer carbon footprint tool. https://aws.amazon.com/aws-cost-management/aws-customer-carbon-footprint-tool
Bender EM, Gebru T, McMillan-Major A, Shmitchell S. 2021. On the dangers of stochastic parrots: can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, in FAccT ‘21. Association for Computing Machinery, New York. 10.1145/3442188.3445922 [DOI] [Google Scholar]
Berkeley Lab. 2023. Laboratory fume hood energy modeler. https://fumehoodcalculator.lbl.gov
Bertoline LMF, Lima AN, Krieger JE, Teixeira SK. 2023. Before and after AlphaFold2: an overview of protein structure prediction. Front Bioinform 3: 1120370. 10.3389/fbinf.2023.1120370 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bouza Heguerte L, Aurélie B, Lannelongue L. 2023. How to estimate carbon footprint when training deep learning models? A guide and review. Environ Res Commun 10.1088/2515-7620/acf81b [DOI] [PMC free article] [PubMed] [Google Scholar]
Carbon Footprint. 2023. 2022 Grid electricity emissions factors. https://www.carbonfootprint.com/docs/2023_02_emissions_factors_sources_for_2022_electricity_v10.pdf
Clément LPPVP, Jacquemotte QES, Hilty LM. 2020. Sources of variation in life cycle assessments of smartphones and tablet computers. Environ Impact Assess Rev 84: 106416. 10.1016/j.eiar.2020.106416 [DOI] [Google Scholar]
Dell Technologies. 2023. Reducing our impact, driving progress. https://www.dell.com/en-uk/dt/corporate/social-impact/advancing-sustainability/sustainable-products-and-services/product-carbon-footprints.htm [Google Scholar]
Dodge J, Prewitt T, Tachet Des Combes R, Odmark E, Schwartz R, Strubell E, Luccioni AS, Smith NA, DeCario N, Buchanan W. 2022. Measuring the carbon intensity of AI in cloud instances. In 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea. ACM, New York. 10.1145/3531146.3533234 [DOI] [Google Scholar]
Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, Repasky MP, Knoll EH, Shelley M, Perry JK, et al. 2004. Glide: a new approach for rapid, accurate docking and scoring. 1: Method and assessment of docking accuracy. J Med Chem 47: 1739–1749. 10.1021/jm0306430 [DOI] [PubMed] [Google Scholar]
Grealey J, Lannelongue L, Saw WY, Marten J, Méric G, Ruiz-Carmona S, Inouye M. 2022. The carbon footprint of bioinformatics. Mol Biol Evol 39: msac034. 10.1093/molbev/msac034 [DOI] [PMC free article] [PubMed] [Google Scholar]
Green Algorithms. 2023. The green algorithms project. https://github.com/GreenAlgorithms/GreenAlgorithms4HPC
Henderson P, Hu J, Romoff J, Brunskill E, Jurafsky D, Pineau J. 2020. Towards the systematic reporting of the energy and carbon footprints of machine learning. J Mach Learn Res 21: 1–43.34305477 [Google Scholar]
Hill N, Bramwell R, Karagianni E, Jones L, MacCarthy J, Hinton S, Walker C, Harris B. 2020. 2020 Government greenhouse gas conversion factors for company reporting: methodology paper for conversion factors final report. Department for Business, Energy and Industrial Strategy, London. [Google Scholar]
IEA. 2019. Global energy & CO₂ status report 2019. https://www.iea.org/reports/global-energy-co2-status-report-2019/emissions [Google Scholar]
Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M. 2003. A Bayesian networks approach for predicting protein–protein interactions from genomic data. Science 302: 449–453. 10.1126/science.1087361 [DOI] [PubMed] [Google Scholar]
Jay M, Ostapenco V, Lefèvre L, Trystram D, Orgerie AC, Fichel B. 2023. An experimental comparison of software-based power meters: focus on CPU and GPU. In CCGrid 2023 - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing. Bangalore, India. [Google Scholar]
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596: 583–589. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
Knödlseder J, Brau-Nogué S, Coriat M, Garnier P, Hughes A, Martin P, Tibaldo L. 2022. Estimate of the carbon footprint of astronomical research infrastructures. Nat Astron 6: 503–513. 10.1038/s41550-022-01612-3 [DOI] [Google Scholar]
Labconscious. 2020. Going green in a wet lab. https://www.labconscious.com/blog/going-green-in-a-wet-lab-symbolic-ivs-high-impact-actions
Lacoste A, Luccioni A, Schmidt V, Dandres T. 2019. Quantifying the carbon emissions of machine learning. arXiv 10.48550/arXiv.1910.09700 [DOI] [Google Scholar]
Lannelongue L, Inouye M. 2022. Pitfalls of machine learning models for protein–protein interactions. bioRxiv 10.1101/2022.02.07.479382 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lannelongue L, Inouye M. 2023. Carbon footprint estimation for computational research. Nat Rev Methods Primers 3: 9. 10.1038/s43586-023-00202-5 [DOI] [Google Scholar]
Lannelongue L, Grealey J, Inouye M. 2021a. Green algorithms: quantifying the carbon footprint of computation. Adv Sci 8: 2100707. 10.1002/advs.202100707 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lannelongue L, Grealey J, Bateman A, Inouye M. 2021b. Ten simple rules to make your computing more environmentally sustainable. PLoS Comput Biol 17: e1009324. 10.1371/journal.pcbi.1009324 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lannelongue L, Aronson HEG, Bateman A, Birney E, Caplan T, Juckes M, McEntyre J, Morris AD, Reilly G, Inouye M. 2023. GREENER principles for environmentally sustainable computational science. Nat Comput Sci 3: 514–521. 10.1038/s43588-023-00461-y [DOI] [PubMed] [Google Scholar]
Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y, et al. 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379: 1123–1130. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
Mbatchou J, Barnard L, Backman J, Marcketta A, Kosmicki JA, Ziyatdinov A, Benner C, O'Dushlaine C, Barber M, Boutkov B, et al. 2021. Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet 53: 1097–1103. 10.1038/s41588-021-00870-7 [DOI] [PubMed] [Google Scholar]
Motmaen A, Dauparas J, Baek M, Abedi MH, Baker D, Bradley P. 2023. Peptide-binding specificity prediction using fine-tuned protein structure prediction networks. Proc Natl Acad Sci 120: e2216697120. 10.1073/pnas.2216697120 [DOI] [PMC free article] [PubMed] [Google Scholar]
My Green Lab. 2023. Energy. https://www.mygreenlab.org/energy.html [Google Scholar]
Myhre G, Shindell D, Bréon FM, Collins W, Fuglestvedt J, Huang J, Koch D, Lamarque JF, Lee D, Mendoza B, et al. 2013. Anthropogenic and natural radiative forcing: in climate change 2013: the physical science basis. Contribution of working group I to the fifth assessment report of the intergovernmental panel on climate change. Cambridge University Press, Cambridge. [Google Scholar]
Nambiar A, Forsyth JM, Liu S, Maslov S. 2023. DR-BERT: a protein language model to annotate disordered regions. bioRxiv 10.1101/2023.02.22.529574 [DOI] [PubMed] [Google Scholar]
Nature Portfolio. 2023. The demand for ultracold storage has soared. https://www.nature.com/articles/d42473-021-00361-7
Nguyen B, Sinistore J, Smith J, Arshi PS, Johnson LM, Kidman T, diCaprio TJ, Carmean D, Strauss K. 2020. Architecting datacenters for sustainability: greener data storage using synthetic DNA. In IEEE electronics goes green 2020. Fraunhofer Institute for Reliability and Microintegration IZM, Berlin. [Google Scholar]
Our World in Data. 2017. CO₂ and Greenhouse Gas Emissions. https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions [Google Scholar]
Phillips JC, Hardy DJ, Maia JDC, Stone JE, Ribeiro JV, Bernardi RC, Buch R, Fiorin G, Hénin J, Jiang W, et al. 2020. Scalable molecular dynamics on CPU and GPU architectures with NAMD. J Chem Phys 153: 044130. 10.1063/5.0014475 [DOI] [PMC free article] [PubMed] [Google Scholar]
Royal Society of Chemistry. 2022. Sustainable laboratories: a community-wide movement toward sustainable laboratory practices. Royal Society of Chemistry, Cambridge. [Google Scholar]
Ruiz-Carmona S, Alvarez-Garcia D, Foloppe N, Garmendia-Doval AB, Juhos S, Schmidtke P, Barril X, Hubbard RE, Morley SD. 2014. Rdock: a fast, versatile and open source program for docking ligands to proteins and nucleic acids. PLoS Comput Biol 10: e1003571. 10.1371/journal.pcbi.1003571 [DOI] [PMC free article] [PubMed] [Google Scholar]
Schwartz R, Dodge J, Smith NA, Etzioni O. 2020. Green AI. Commun ACM 63: 54–63. 10.1145/3381831 [DOI] [Google Scholar]
Seagate. 2023. Product sustainability. https://www.seagate.com/esg/planet/product-sustainability
Stevens ARH, Bellstedt S, Elahi PJ, Murphy MT. 2020. The imperative to reduce carbon emissions in astronomy. Nat Astron 4: 843–851. 10.1038/s41550-020-1169-1 [DOI] [Google Scholar]
Trott O, Olson AJ. 2009. Autodock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 31: 455–461. 10.1002/jcc.21334 [DOI] [PMC free article] [PubMed] [Google Scholar]
Urbina MA, Watts AJR, Reardon EE. 2015. Labs should cut plastic waste too. Nature 528: 479–479. 10.1038/528479c [DOI] [PubMed] [Google Scholar]
University of Exeter. 2023. Sustainable labs. http://www.exeter.ac.uk/about/sustainability/sustainablelabs/energy/ultfreezers [Google Scholar]
World Health Organization. 2021. Children and digital dumpsites: E-waste exposure and child health. World Health Organization, Geneva. https://apps.who.int/iris/handle/10665/341718 [Google Scholar]

[PIBMLPA041473C1] ALLEA. 2022. Towards climate sustainability of the academic system in Europe and beyond. ALLEA, Berlin. 10.26356/climate-sust-acad [DOI] [Google Scholar]

[PIBMLPA041473C2] Allen MR, Shine KP, Fuglestvedt JS, Millar RJ, Cain M, Frame DJ, Macey AH. 2018. A solution to the misrepresentations of CO2-equivalent emissions of short-lived climate pollutants under ambitious mitigation. NPJ Clim Atmos Sci 1: 16. 10.1038/s41612-018-0026-8 [DOI] [Google Scholar]

[PIBMLPA041473C3] Anthony LFW, Kanding B, Selvan R. 2020. Carbontracker: tracking and predicting the carbon footprint of training deep learning models. arXiv 10.48550/arXiv.2007.03051 [DOI] [Google Scholar]

[PIBMLPA041473C700] Apple Environmental Reports. 2023. A first for Apple. https://www.apple.com/environment

[PIBMLPA041473C4] Arias PA, Bellouin N, Coppola E, Jones RG, Krinner G, Marotzke J, Naik V, Palmer MD, Plattner GK, Rogelj J, et al. 2021. Technical summary: in climate change 2021: the physical science basis. Contribution of working group I to the sixth assessment report of the intergovernmental panel on climate change, pp. 33–144. Cambridge University Press, Cambridge. 10.1017/9781009157896.002 [DOI] [Google Scholar]

[PIBMLPA041473C701] AWS. 2023. Customer carbon footprint tool. https://aws.amazon.com/aws-cost-management/aws-customer-carbon-footprint-tool

[PIBMLPA041473C5] Bender EM, Gebru T, McMillan-Major A, Shmitchell S. 2021. On the dangers of stochastic parrots: can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, in FAccT ‘21. Association for Computing Machinery, New York. 10.1145/3442188.3445922 [DOI] [Google Scholar]

[PIBMLPA041473C750] Berkeley Lab. 2023. Laboratory fume hood energy modeler. https://fumehoodcalculator.lbl.gov

[PIBMLPA041473C6] Bertoline LMF, Lima AN, Krieger JE, Teixeira SK. 2023. Before and after AlphaFold2: an overview of protein structure prediction. Front Bioinform 3: 1120370. 10.3389/fbinf.2023.1120370 [DOI] [PMC free article] [PubMed] [Google Scholar]

[PIBMLPA041473C11] Bouza Heguerte L, Aurélie B, Lannelongue L. 2023. How to estimate carbon footprint when training deep learning models? A guide and review. Environ Res Commun 10.1088/2515-7620/acf81b [DOI] [PMC free article] [PubMed] [Google Scholar]

[PIBMLPA041473C801] Carbon Footprint. 2023. 2022 Grid electricity emissions factors. https://www.carbonfootprint.com/docs/2023_02_emissions_factors_sources_for_2022_electricity_v10.pdf

[PIBMLPA041473C7] Clément LPPVP, Jacquemotte QES, Hilty LM. 2020. Sources of variation in life cycle assessments of smartphones and tablet computers. Environ Impact Assess Rev 84: 106416. 10.1016/j.eiar.2020.106416 [DOI] [Google Scholar]

[PIBMLPA041473C500] Dell Technologies. 2023. Reducing our impact, driving progress. https://www.dell.com/en-uk/dt/corporate/social-impact/advancing-sustainability/sustainable-products-and-services/product-carbon-footprints.htm [Google Scholar]

[PIBMLPA041473C8] Dodge J, Prewitt T, Tachet Des Combes R, Odmark E, Schwartz R, Strubell E, Luccioni AS, Smith NA, DeCario N, Buchanan W. 2022. Measuring the carbon intensity of AI in cloud instances. In 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea. ACM, New York. 10.1145/3531146.3533234 [DOI] [Google Scholar]

[PIBMLPA041473C9] Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, Repasky MP, Knoll EH, Shelley M, Perry JK, et al. 2004. Glide: a new approach for rapid, accurate docking and scoring. 1: Method and assessment of docking accuracy. J Med Chem 47: 1739–1749. 10.1021/jm0306430 [DOI] [PubMed] [Google Scholar]

[PIBMLPA041473C10] Grealey J, Lannelongue L, Saw WY, Marten J, Méric G, Ruiz-Carmona S, Inouye M. 2022. The carbon footprint of bioinformatics. Mol Biol Evol 39: msac034. 10.1093/molbev/msac034 [DOI] [PMC free article] [PubMed] [Google Scholar]

[PIBMLPA041473C501] Green Algorithms. 2023. The green algorithms project. https://github.com/GreenAlgorithms/GreenAlgorithms4HPC

[PIBMLPA041473C12] Henderson P, Hu J, Romoff J, Brunskill E, Jurafsky D, Pineau J. 2020. Towards the systematic reporting of the energy and carbon footprints of machine learning. J Mach Learn Res 21: 1–43.34305477 [Google Scholar]

[PIBMLPA041473C13] Hill N, Bramwell R, Karagianni E, Jones L, MacCarthy J, Hinton S, Walker C, Harris B. 2020. 2020 Government greenhouse gas conversion factors for company reporting: methodology paper for conversion factors final report. Department for Business, Energy and Industrial Strategy, London. [Google Scholar]

[PIBMLPA041473C600] IEA. 2019. Global energy & CO₂ status report 2019. https://www.iea.org/reports/global-energy-co2-status-report-2019/emissions [Google Scholar]

[PIBMLPA041473C14] Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M. 2003. A Bayesian networks approach for predicting protein–protein interactions from genomic data. Science 302: 449–453. 10.1126/science.1087361 [DOI] [PubMed] [Google Scholar]

[PIBMLPA041473C15] Jay M, Ostapenco V, Lefèvre L, Trystram D, Orgerie AC, Fichel B. 2023. An experimental comparison of software-based power meters: focus on CPU and GPU. In CCGrid 2023 - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing. Bangalore, India. [Google Scholar]

[PIBMLPA041473C16] Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596: 583–589. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[PIBMLPA041473C17] Knödlseder J, Brau-Nogué S, Coriat M, Garnier P, Hughes A, Martin P, Tibaldo L. 2022. Estimate of the carbon footprint of astronomical research infrastructures. Nat Astron 6: 503–513. 10.1038/s41550-022-01612-3 [DOI] [Google Scholar]

[PIBMLPA041473C702] Labconscious. 2020. Going green in a wet lab. https://www.labconscious.com/blog/going-green-in-a-wet-lab-symbolic-ivs-high-impact-actions

[PIBMLPA041473C18] Lacoste A, Luccioni A, Schmidt V, Dandres T. 2019. Quantifying the carbon emissions of machine learning. arXiv 10.48550/arXiv.1910.09700 [DOI] [Google Scholar]

[PIBMLPA041473C19] Lannelongue L, Inouye M. 2022. Pitfalls of machine learning models for protein–protein interactions. bioRxiv 10.1101/2022.02.07.479382 [DOI] [PMC free article] [PubMed] [Google Scholar]

[PIBMLPA041473C20] Lannelongue L, Inouye M. 2023. Carbon footprint estimation for computational research. Nat Rev Methods Primers 3: 9. 10.1038/s43586-023-00202-5 [DOI] [Google Scholar]

[PIBMLPA041473C21] Lannelongue L, Grealey J, Inouye M. 2021a. Green algorithms: quantifying the carbon footprint of computation. Adv Sci 8: 2100707. 10.1002/advs.202100707 [DOI] [PMC free article] [PubMed] [Google Scholar]

[PIBMLPA041473C22] Lannelongue L, Grealey J, Bateman A, Inouye M. 2021b. Ten simple rules to make your computing more environmentally sustainable. PLoS Comput Biol 17: e1009324. 10.1371/journal.pcbi.1009324 [DOI] [PMC free article] [PubMed] [Google Scholar]

[PIBMLPA041473C23] Lannelongue L, Aronson HEG, Bateman A, Birney E, Caplan T, Juckes M, McEntyre J, Morris AD, Reilly G, Inouye M. 2023. GREENER principles for environmentally sustainable computational science. Nat Comput Sci 3: 514–521. 10.1038/s43588-023-00461-y [DOI] [PubMed] [Google Scholar]

[PIBMLPA041473C24] Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y, et al. 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379: 1123–1130. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]

[PIBMLPA041473C25] Mbatchou J, Barnard L, Backman J, Marcketta A, Kosmicki JA, Ziyatdinov A, Benner C, O'Dushlaine C, Barber M, Boutkov B, et al. 2021. Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet 53: 1097–1103. 10.1038/s41588-021-00870-7 [DOI] [PubMed] [Google Scholar]

[PIBMLPA041473C26] Motmaen A, Dauparas J, Baek M, Abedi MH, Baker D, Bradley P. 2023. Peptide-binding specificity prediction using fine-tuned protein structure prediction networks. Proc Natl Acad Sci 120: e2216697120. 10.1073/pnas.2216697120 [DOI] [PMC free article] [PubMed] [Google Scholar]

[PIBMLPA041473C901] My Green Lab. 2023. Energy. https://www.mygreenlab.org/energy.html [Google Scholar]

[PIBMLPA041473C27] Myhre G, Shindell D, Bréon FM, Collins W, Fuglestvedt J, Huang J, Koch D, Lamarque JF, Lee D, Mendoza B, et al. 2013. Anthropogenic and natural radiative forcing: in climate change 2013: the physical science basis. Contribution of working group I to the fifth assessment report of the intergovernmental panel on climate change. Cambridge University Press, Cambridge. [Google Scholar]

[PIBMLPA041473C28] Nambiar A, Forsyth JM, Liu S, Maslov S. 2023. DR-BERT: a protein language model to annotate disordered regions. bioRxiv 10.1101/2023.02.22.529574 [DOI] [PubMed] [Google Scholar]

[PIBMLPA041473C951] Nature Portfolio. 2023. The demand for ultracold storage has soared. https://www.nature.com/articles/d42473-021-00361-7

[PIBMLPA041473C29] Nguyen B, Sinistore J, Smith J, Arshi PS, Johnson LM, Kidman T, diCaprio TJ, Carmean D, Strauss K. 2020. Architecting datacenters for sustainability: greener data storage using synthetic DNA. In IEEE electronics goes green 2020. Fraunhofer Institute for Reliability and Microintegration IZM, Berlin. [Google Scholar]

[PIBMLPA041473C850] Our World in Data. 2017. CO₂ and Greenhouse Gas Emissions. https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions [Google Scholar]

[PIBMLPA041473C30] Phillips JC, Hardy DJ, Maia JDC, Stone JE, Ribeiro JV, Bernardi RC, Buch R, Fiorin G, Hénin J, Jiang W, et al. 2020. Scalable molecular dynamics on CPU and GPU architectures with NAMD. J Chem Phys 153: 044130. 10.1063/5.0014475 [DOI] [PMC free article] [PubMed] [Google Scholar]

[PIBMLPA041473C31] Royal Society of Chemistry. 2022. Sustainable laboratories: a community-wide movement toward sustainable laboratory practices. Royal Society of Chemistry, Cambridge. [Google Scholar]

[PIBMLPA041473C32] Ruiz-Carmona S, Alvarez-Garcia D, Foloppe N, Garmendia-Doval AB, Juhos S, Schmidtke P, Barril X, Hubbard RE, Morley SD. 2014. Rdock: a fast, versatile and open source program for docking ligands to proteins and nucleic acids. PLoS Comput Biol 10: e1003571. 10.1371/journal.pcbi.1003571 [DOI] [PMC free article] [PubMed] [Google Scholar]

[PIBMLPA041473C33] Schwartz R, Dodge J, Smith NA, Etzioni O. 2020. Green AI. Commun ACM 63: 54–63. 10.1145/3381831 [DOI] [Google Scholar]

[PIBMLPA041473C900] Seagate. 2023. Product sustainability. https://www.seagate.com/esg/planet/product-sustainability

[PIBMLPA041473C34] Stevens ARH, Bellstedt S, Elahi PJ, Murphy MT. 2020. The imperative to reduce carbon emissions in astronomy. Nat Astron 4: 843–851. 10.1038/s41550-020-1169-1 [DOI] [Google Scholar]

[PIBMLPA041473C35] Trott O, Olson AJ. 2009. Autodock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 31: 455–461. 10.1002/jcc.21334 [DOI] [PMC free article] [PubMed] [Google Scholar]

[PIBMLPA041473C36] Urbina MA, Watts AJR, Reardon EE. 2015. Labs should cut plastic waste too. Nature 528: 479–479. 10.1038/528479c [DOI] [PubMed] [Google Scholar]

[PIBMLPA041473C950] University of Exeter. 2023. Sustainable labs. http://www.exeter.ac.uk/about/sustainability/sustainablelabs/energy/ultfreezers [Google Scholar]

[PIBMLPA041473C37] World Health Organization. 2021. Children and digital dumpsites: E-waste exposure and child health. World Health Organization, Geneva. https://apps.who.int/iris/handle/10665/341718 [Google Scholar]

PERMALINK

Environmental Impacts of Machine Learning Applications in Protein Science

Loïc Lannelongue

Michael Inouye

Abstract

THE ENVIRONMENTAL IMPACTS OF COMPUTATIONS

Figure 1.

Estimating Carbon Footprints in Practice

About Experimental Work

ALGORITHMS AND PROTEIN SCIENCE

Figure 2.

Molecular Simulations

Protein–Protein Interactions

Protein Structure Prediction: AlphaFold and ESMFold

CONCLUSION

ACKNOWLEDGMENTS

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Environmental Impacts of Machine Learning Applications in Protein Science

Loïc Lannelongue

Michael Inouye

Abstract

THE ENVIRONMENTAL IMPACTS OF COMPUTATIONS

Figure 1.

Estimating Carbon Footprints in Practice

About Experimental Work

ALGORITHMS AND PROTEIN SCIENCE

Figure 2.

Molecular Simulations

Protein–Protein Interactions

Protein Structure Prediction: AlphaFold and ESMFold

CONCLUSION

ACKNOWLEDGMENTS

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases