Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Nov 15.
Published in final edited form as: Nat Methods. 2024 Aug;21(8):1387–1389. doi: 10.1038/s41592-024-02324-4

Beyond protein lists: AI-assisted interpretation of proteomic investigations in the context of evolving scientific knowledge

Benjamin M Gyori 1,2,3,*, Olga Vitek 1,2,*
PMCID: PMC12617825  NIHMSID: NIHMS2110763  PMID: 39122950

Abstract

Mass spectrometry-based proteomics provides broad and quantitative detection of the proteome, but its results are mostly presented as protein lists. Artificial intelligence approaches will exploit prior knowledge from literature and harmonize fragmented datasets to enable mechanistic and functional interpretation of proteomics experiments.


Recent advances in experimentation, instrumentation and computation have dramatically increased the capabilities of mass spectrometry-based proteomics. Machine learning and artificial intelligence (AI) have played a critical role throughout this transformation. For example, it is now possible to predict technical characteristics of a proteomic experiment such as patterns of enzymatic protein digestion, fractionation, chromatographic separation, peptide observability or ion intensities in the mass spectra. These predictions, combined with other technological advances, have increased the coverage of the proteome to over 10,000 proteins in biofluids and tissues, with high quantitative accuracy. Similarly, it is now possible to interpret experimentally observed protein sequences, modifications or abundances in the proteomes in the context of predicted protein function or predicted 3D structure, deepening functional insights. In our view, an even broader variety of AI-driven advances will appear in the near future1. They will improve our ability to interpret spectra and harmonize and retrieve data from large and complex spectral repositories, and enable natural language-based interaction with these resources.

Despite these advances, most applications of mass spectrometry-based proteomics continue to produce results in the form of identified and quantified protein lists. In our collaborations, we have observed that it takes a painstaking expert-driven effort to design an experiment that is informative about biomolecular function and to translate the protein list into actionable insights and experimental follow-up. It is particularly challenging to interpret proteome-wide measurements in the context of rapidly accumulating biomedical knowledge. It is our opinion that the next frontier of AI-driven proteomics will alleviate this effort and will enable the machine-assisted design and actionable interpretation of proteomic experiments.

Systems approaches to biology have been pursued for at least two decades with the goal of understanding the collective behavior of proteins and other biomolecules from high-throughput data. On one hand, this has involved building mathematical models based on physicochemical principles (such as the kinetics of kinase–substrate binding and phosphorylation) to, for instance, identify molecular drivers of liver disease processes from mass spectrometry-based proteomics using a model of 23 proteins and 26 reactions involved in growth factor signaling2. On the other hand, data-driven approaches (such as correlation of co-expression, hierarchical clustering, statistical enrichment and mutual exclusivity) are widely used to produce association-based insights from large proteomics datasets such as Clinical Proteomic Tumor Analysis Consortium (CPTAC) data3. However, there remains a gap between these scales, limiting their success. Mathematical models typically focus on isolated pathways limited to dozens of proteins, whereas correlative analyses of proteome-scale data lose the connection to prior mechanistic knowledge, making results difficult to interpret.

In addition, we have observed that applications of systems biology approaches are particularly challenging in proteomics. This is due to the complex, dynamic nature of the proteome (including isoforms stemming from alternative splicing and the highly combinatorial space of post-translational modifications), incomplete measurements of the proteome, limited numbers of replicates and measurement uncertainty. Further, proteomic data are routinely available without sufficient semantic annotation, subject to inconsistent identifier standards and derived from distinct and difficult-to-reconcile preprocessing pipelines. They are thus incompatible with methods that rely on large quantities of training data in a standardized form4.

Overall, we believe that the current interpretation of proteomic experiments is hampered by a disconnect between the technical challenges of studying the proteomes, the available detailed mechanistic approaches that rely on prior scientific knowledge but do not to scale to the proteomes, and the correlative approaches applicable to large datasets that fail to capture functional mechanisms. This gap is nothing but an artifact of current scientific practice, and it is rooted in the fundamentally limited ability of humans to integrate and dynamically update knowledge and data from multiple sources at scale.

We believe that future advances in machine learning and AI will make it possible to learn representations of complex systems from data and to compute over these systems with high speed. This, in turn, will allow us to scale up systems biology approaches to whole proteomes and close the gaps in the currently fragmented efforts. We anticipate that this progress will occur along the following three directions (Fig. 1).

Fig. 1:

Fig. 1:

Future artificial intelligence approaches for interpreting proteomics experiments.

Future artificial intelligence approaches will scale systems biology to whole proteomes by harmonizing disconnected repositories of prior knowledge, creating an interoperability layer across heterogeneous datasets and integrating the information in a new generation of models that produce actionable hypotheses and experimental follow-up.

First, previously disconnected repositories of prior knowledge will be harmonized. There exists a multitude of complementary but partially overlapping pathway databases describing protein function5; harnessing biomedical literature at such a scale is only possible with machine reading6. Automated knowledge assembly approaches that keep up with the evolving scientific literature are already creating causal and mechanistic representations of cell biology, which are more detailed and broader in scope than what was possible with human curation7,8.

Second, machine-assisted approaches will be deployed to create an interoperability layer across heterogeneous datasets, available across multiple repositories and organizations. Several major efforts by funding agencies, including the NCATS Biomedical Data Translator (https://ncats.nih.gov/research/research-activities/translator) and the ARPA-H Biomedical Data Fabric Toolbox (https://arpa-h.gov/research-and-funding/programs/arpa-h-bdf-toolbox) programs, speak to the urgency of the problem. These efforts will produce machine-assisted approaches that will streamline the annotations of key experimental factors (for example, cell lines, small molecules used for perturbations, protein identities and the overall context of a study, such as a disease). They will result in data interoperability within and across key community repositories such as the Cancer Research Data Commons (https://datacommons.cancer.gov/). Maintaining a catalog of resources providing identifiers for experimental factors is central to these efforts9, and semi-automated data annotation with such identifiers will play a key role in this integration in the future.

Third, machine learning methods will effectively combine detailed representations of prior biomedical knowledge from the literature as it evolves in time, large quantities of prior proteomic and other experimental data from various sources, and context-specific data relevant to a specific study. Although knowledge graphs have already provided a flexible framework for such semantic integration10, they have a limited ability to express causal mechanisms of system behavior, and we predict that new alternatives will emerge. For example, some areas of biology, in particular those relying on transcriptomics, effectively used transformer-based deep learning architectures as “foundation models” tunable to specific tasks such as transcriptome-level predictions of response to perturbations11. Together, progress in these three directions will create a new, principled way to relate novel experimental observations to pre-existing data and knowledge and detect surprising new findings in large proteomic datasets. This task is virtually intractable to humans alone or with current computational methods.

Our ability to scale the systems biology methods to whole proteomes will have far-reaching implications. In particular, it will close the loop between experimentation and computation. We therefore foresee an increasing adoption of methods for automated hypothesis generation and planning of experimental follow-up12. Data from expensive small-scale experiments will be augmented using generative models and transfer learning methods to simulate future experiments and optimize for their design. In clinical settings, digital twin models13 making personalized predictions that evolve over time will similarly close the loop between data and models.

Despite these promises, the new generation of computational approaches also offers substantial risks. Our ability to represent biological systems, and compute on these representations on a larger scale, will not always translate into meaningful insights. When not carefully designed, machine learning and AI capture both true biological behavior and underlying artifacts, effectively regurgitating existing patterns as a “stochastic parrot”14. In practice, this means that model-based interpretations can become biased toward previously over-studied proteins and may even disincentivize critical assessments of the input data, making it difficult to illuminate the “dark proteome”15. In clinical settings, large-scale integration of proteomic datasets with data from other sources can learn person-specific patterns and raise human subject privacy concerns. These issues need their own methodological solutions before the field becomes dominated by powerful methods that we do not fully understand.

We believe that a solution to sustaining these developments while avoiding the risks lies in transparent algorithms and software, as well as in interdisciplinary training. Open-source software and open method evaluation will be supported by a specialized computing ecosystem of interoperable open-source tools and experimental and synthetic benchmarks, capable of simultaneously handling large models and large datasets. In training, traditional degree programs may be complemented by focused short courses worldwide that train participants in interdisciplinary aspects of proteomic research. The courses will build on a long tradition of interdisciplinary training in the proteomics community, offered for example by EMBL, the University of Washington, the University of Wisconsin and the May Institute at Northeastern University, and at scientific meetings. With these efforts, accurate and actionable interpretation of proteomics data in the context of evolving scientific knowledge will be well within our reach.

References

  • 1.Neely BA et al. Toward an Integrated Machine Learning Model of a Proteomics Experiment. J. Proteome Res 22, 681–696 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Burbano De Lara S et al. Basal MET phosphorylation is an indicator of hepatocyte dysregulation in liver disease. Mol Syst Biol 20, 187–216 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wu P et al. Integration and Analysis of CPTAC Proteomics Data in the Context of Cancer Genomics in the cBioPortal. Mol Cell Proteomics 18, 1893–1898 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Listgarten J The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a ‘scientist’. Nature Biotechnology 1–3 (2024). [DOI] [PubMed] [Google Scholar]
  • 5.Türei D, Korcsmáros T & Saez-Rodriguez J OmniPath: guidelines and gateway for literature-curated signaling pathway resources. Nature methods 13, 966–967 (2016). [DOI] [PubMed] [Google Scholar]
  • 6.Valenzuela-Escárcega MA et al. Large-scale automated machine reading discovers new cancer-driving mechanisms. Database (Oxford) 2018, (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gyori BM, Bachman JA & Kolusheva D A self-updating causal model of COVID-19 mechanisms built from the scientific literature. in BioCreative VII Challenge Evaluation Workshop, Virtual workshop 249–253 (2021). [Google Scholar]
  • 8.Bachman JA, Gyori BM & Sorger PK Automated assembly of molecular mechanisms at scale from text mining and curated databases. Molecular Systems Biology 19, e11325 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hoyt CT et al. Unifying the identification of biomedical entities with the Bioregistry. Sci Data 9, 714 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Callahan TJ et al. An open source knowledge graph ecosystem for the life sciences. Sci Data 11, 363 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Cui H et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods 1–11 (2024). [DOI] [PubMed] [Google Scholar]
  • 12.Kitano H Nobel Turing Challenge: creating the engine for scientific discovery. NPJ systems biology and applications 7, 29 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Tang C et al. A roadmap for the development of human body digital twins. Nature Reviews Electrical Engineering 1–9 (2024). [Google Scholar]
  • 14.Messeri L & Crockett MJ, Artificial intelligence and illusions of understanding in scientific research. Nature 627(8002), 49–58 (2024). [DOI] [PubMed] [Google Scholar]
  • 15.Kustatscher G et al. Understudied proteins: opportunities and challenges for functional proteomics. Nat Methods 19, 774–779 (2022). [DOI] [PubMed] [Google Scholar]

RESOURCES