Skip to main content
Nature Communications logoLink to Nature Communications
. 2025 Sep 25;16:8421. doi: 10.1038/s41467-025-64089-4

Facilitating analysis and dissemination of proteomics data through metadata integration in MaxQuant

Walter Viegener 1,#, Shamil Urazbakhtin 1,#, Daniela Ferretti 1, Jürgen Cox 1,, Jinqiu Xiao 1,
PMCID: PMC12462489  PMID: 40998832

Abstract

Metadata plays an essential role in the analysis and dissemination of proteomics data. It annotates sample information for output tables from library searches and displays sample information from data files in public repositories. However, integrating metadata into data analysis can be time-consuming and is not well standardized. Inconsistent metadata formats in public repositories hinder other researchers’ ability to reproduce and reuse these public datasets. Here we present the metadata integration in MaxQuant, which provides a user-friendly way to export metadata as SDRF, the standard format that maps sample properties to proteomics data files. We also implemented the annotation of output tables with the SDRF file, enabling users to perform seamless downstream data analysis with annotated output tables. These features provide a simple and standardized approach to creating and leveraging standardized metadata, thereby facilitating data analysis and improving the reusability and reproducibility of public proteomics datasets.

Subject terms: Data integration, Computational platforms and environments, Data publication and archiving, Software, Standards


Metadata annotation is crucial for data analysis, however it is often overlooked when depositing proteomics data. Here, the authors introduce Sample and Data Relationship Format (SDRF)-compliant metadata integration feature in MaxQuant, a widely used software for proteomics data analysis.

Introduction

Mass spectrometry-based proteomics is the study of expression, interaction and modification of all proteins in living organisms using mass spectrometry1. Advances in instrumentation and data acquisition methods have made it a more powerful tool for studying active biological processes by greatly improving the resolution, sensitivity, reproducibility and throughput of the measurement25. After the sample measurement, library search is performed to identify and quantify the peptides and proteins6. Then output tables that store the library search results need to be annotated with the metadata before the downstream data analysis, due to the fact that the column names of output tables are created based on the data files and may not directly reflect the sample-related properties.

This annotation of output tables could be very cumbersome for studies with large cohorts and complex experimental design. It’s also more complicated to annotate studies with multiplexed and fractionated experiments, since there is no direct relation between samples and data files. For multiplexed experiments, multiple samples are included in the same data file7. For fractionated samples, multiple data files are related to the same sample8. If the metadata are in inconsistent formats across datasets within a single study, such as when reanalyzing multiple public datasets, additional effort would be required for the manual annotation of the output tables9.

In order to address inconsistent and incomplete metadata in proteomics studies, the Sample and Data Relationship Format (SDRF) was introduced as a standardized format to store the proteomics sample metadata10. The SDRF is a tab-delimited text format that describes the relationship between samples and data files in proteomics studies. It consists of sample-related metadata, data file-related metadata and the variables under study. Submitting the SDRF file together with raw files to public repositories like ProteomeXchange11 is recommended to enhance the reusability and reproducibility of public datasets. Several tools have been developed for exporting12,13 or reading SDRF14,15 to promote this standard metadata format.

However, very few proteomics datasets currently include the SDRF file in data submissions12. First, creating the SDRF file remains complicated and is not mandatory for data submission. Currently, users need to fill the SDRF file row by row, and each row corresponds to one sample and data file relationship, therefore for some experimental setups many rows require the entry of repetitive information. For example, multiplexed experiments have repetitive data file properties, and fractionated experiments have repetitive sample properties. Second, users benefit little from creating the SDRF file for their own projects because it doesn’t significantly impact the data analysis. Creating the SDRF file requires documenting lots of required data file and sample properties, but most of the data file properties will not be taken into consideration in data analysis. Furthermore, sample properties in the SDRF file currently cannot be used to directly annotate output tables.

In this work, we address the aforementioned challenges by implementing the metadata integration feature in MaxQuant, a widely used software that supports proteomics data analysis from various instruments, quantification techniques and data acquisition modes1618. This feature simplifies the process of creating an SDRF file. We also implement automatic metadata annotation of output tables using Perseus or the provided scripts to further assist users with their own data analysis.

Results and discussion

Overview of metadata integration in MaxQuant

In addition to the raw files and FASTA files, metadata integration is implemented in the MaxQuant workflow (Fig. 1). The SDRF file, which is created in the process, consists of all required sample properties, which are mainly filled out by users, as well as all required data file properties, which are extracted automatically by MaxQuant from the input files and user-defined parameters.

Fig. 1. Overview of metadata integration in MaxQuant and its application.

Fig. 1

MaxQuant converts metadata to the SDRF file, which can then be used for data analysis and data dissemination. SDRF Sample and Data Relationship Format.

The SDRF file could play an essential role in both data dissemination and analysis. First, it could be uploaded to public repositories alongside raw files to improve the reproducibility and reusability of the datasets. Moreover, it can be used to automatically annotate the output tables from MaxQuant for Perseus, as well as other tools using the provided scripts written in Python and R.

Export metadata as an SDRF file in MaxQuant

Since MaxQuant v2.7.0, the feature to export metadata as an SDRF file has been implemented in the “Metadata” tab (Fig. 2). In MaxQuant, all required data file properties specified by SDRF would be extracted and converted to ontology terms and accession numbers when the SDRF file is written (Table 1). After setting up the parameters for the library search, click the “Refresh” button under the “Metadata” tab, and the corresponding metadata table will appear. To reduce the confusion caused by having too many columns, the metadata table only includes basic data file information and the required sample properties that users need to fill out. An additional column named “Group” is also included, representing the study variable “factor value[group]” in SDRF. This column doesn’t have to follow ontology-based rules, users can define any project-specific treatments or phenotypes that do not belong to one of the required sample properties.

Fig. 2. Metadata table in MaxQuant GUI.

Fig. 2

The metadata table is automatically generated according to the experimental settings under the ‘Metadata’ tab. The following screenshot shows the metadata table for a 24-fraction TMT10 dataset.

Table 1.

Overview of the columns from the SDRF that MaxQuant fills

Column name Source
Source name From “Raw data” -> user-specified “Experiment”
Characteristics[organism] From the fasta file if UniProt database is used, otherwise needs to be filled by users
Technology type Filled with “proteomic profiling by mass spectrometry”
Assay name From file name without file extension
Comment[data file] From file name including file extension
Comment[fraction identifier] From “Raw data” -> user-specified “Fraction”
Comment[label] From “Group-specific parameters” -> “Type” -> user-specified “Type”
Comment[cleavage agent details] From “Group-specific parameters” -> “Digestion” -> user-specified “Enzyme”
Comment[instrument] From instrument information recorded in the raw files
Comment[modification parameters] From “Group-specific parameters” -> “Modifications” -> user-specified “Fixed modifications” and “Variable modifications”
Comment[proteomics data acquisition method] From “Group-specific parameters” -> “Type” -> user-specified “Type”
Comment[tool metadata] MaxQuant with version information

The layout of the metadata table in MaxQuant differs from the SDRF, wherein each row represents one sample and data file relationship. In the metadata table, however, each row represents one sample. Since only sample properties need to be filled in to create an SDRF file in MaxQuant, the layout of metadata table is particularly user-friendly. The exact layout is also automatically generated according to the experimental settings, thereby eliminating the need to fill in repetitive information. For multiplexed dataset, there will be expanded rows corresponding to different labels for each raw file, all labels will be assigned identical data file properties. For fractionated dataset different raw files corresponding to the same sample will be collapsed into one row, these raw files will be assigned identical sample properties. For example, in an experiment with one TMT10 set fractionated into 20 fractions, the SDRF file contains 200 rows. Users only need to fill a 10-row metadata table in MaxQuant to generate the SDRF file. When MaxQuant writes the SDRF file, information from the metadata table will be converted to the SDRF layout and combined with corresponding data file properties.

The metadata table can be filled out in MaxQuant GUI or in the template exported by MaxQuant. There’s no maximum limit to the number of rows that can be generated in the metadata table. Users can either edit multiple rows at once in MaxQuant GUI, or fill out the exported template outside of MaxQuant. In the SDRF, some sample properties are required but might not be available for users’ projects (e.g. the ancestry category for human samples). Users can leave these columns empty, and MaxQuant will have the option to automatically fill them with “not available”. Then the exported SDRF file will have no missing values and can be submitted directly to public repositories without further editing. The SDRF file will be created later as one of the output tables after the library search, or immediately by clicking the “Write SDRF” button (Supplementary Data 1, example of SDRF file generated by MaxQuant).

In summary, the auto-adjusted layout and autofill feature of the metadata table in MaxQuant simplify the process of creating the SDRF file. Automatic extraction of data file properties saves users the effort of collecting and organizing information that rarely impacts their data analysis. It is especially helpful for proteomics facilities to provide a well-annotated SDRF file to clients without the analytical expertise to annotate the data file properties. The possibility to edit multiple rows simultaneously and read the template file improves efficiency especially when handling large datasets. By default, the SDRF file becomes one of the regular tables written in every MaxQuant run. This feature will encourage the deposit of standardized SDRF metadata in public proteomics repositories, helping to improve the reproducibility and reusability of public datasets.

Annotate the MaxQuant output tables with the SDRF file for downstream data analysis

In MaxQuant, “Experiment” is the unique identifier for each sample, except for multiplexed datasets, in which each sample is represented as a combination of “Experiment” and label. In the SDRF, the source name has the same definition. Therefore, in the MaxQuant-exported SDRF file, the source name is identical to “Experiment” (combined with the label if the dataset is multiplexed). Thus, it can be used to annotate MaxQuant output tables.

Perseus is a widely used software that performs downstream omics data analysis such as pre-processing, statistical analysis and data visualization19. Since it was co-developed with MaxQuant, it is especially compatible with MaxQuant output tables. The feature that annotates MaxQuant output tables with the SDRF file has been implemented since Perseus v2.1.5. Clicking “Read SDRF” under “Annot. rows” automatically adds all information from the SDRF file to the corresponding intensity columns as annotation rows. By default, “Skip repetitive properties” is selected to only add properties that are not identical between all samples as annotation rows (Fig. 3). With added annotation rows from the SDRF file, users can continue with data filtering, normalization and statistical analysis directly with the annotated output tables. Two scrips (https://github.com/cox-labs/converters), written in Python and R, are provided to convert MaxQuant output tables and the SDRF file into an expression matrix and a sample annotation table. These two files are required by other popular tools for downstream data analysis, such as DESeq220, limma21, and edgeR22. As long as the source name is identical to “Experiment” in MaxQuant, SDRF files created by other tools can also be used to annotate output tables with Perseus or can be converted by the provided scripts.

Fig. 3. Annotation of the protein groups table with the SDRF file in Perseus.

Fig. 3

The protein groups table from a 24-fraction TMT10 dataset is annotated with the SDRF file. “Skip redundant properties” option is selected to only keep the properties that are not identical across all samples.

The SDRF was initially introduced to store metadata during data dissemination, which is one of the last steps in finalizing a proteomics study. Here, we extended the application of the SDRF file to output tables annotation, which is a key step in data analysis. As a result, this output table annotation feature will motivate users to create SDRF files because it helps with data analysis. As the SDRF was adapted from the MicroArray Gene Expression Tabular (MAGE-TAB)23, which is used to encode metadata annotations for RNA-Seq data24, SDRF files could also perform as a bridge for multi-omics data integration in the future.

Methods

Software development, requirements and usage

Metadata integration is a new feature implemented in MaxQuant v2.7.0 and Perseus v2.1.5. MaxQuant is developed using.NET8 and written in C# programming language. The command-line version can be run on Windows, Linux and macOS (provided that the vendor libraries are available for accessing the raw files). The graphical user interface (GUI) is only available on Windows at the moment. It can be downloaded from https://www.maxquant.org/maxquant. Perseus is developed using.NET8 and written in C# programming language. Its GUI is on Windows only at the moment. It can be downloaded from https://www.maxquant.org/perseus/.

The video tutorials about MaxQuant and Perseus are available at https://www.youtube.com/@MaxQuantChannel. The detailed tutorial on exporting the SDRF file and using it to annotate output tables is provided as PDF in the Supplementary Information. Alternatively, the video tutorial can be viewed on YouTube at https://youtu.be/fHCPOBXXRp8.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Supplementary Information (759.3KB, pdf)
41467_2025_64089_MOESM2_ESM.pdf (23.1KB, pdf)

Description of Additional Supplementary Data

Supplementary Data 1 (10.3KB, txt)
Reporting Summary (847.8KB, pdf)

Acknowledgements

The authors would like to thank Barbara Steigenberger for her explanation of the metadata management. This work was supported by the German Ministry for Science and Education funding action CLINSPECT-M [FKZ 161L0214E, to J.X.].

Author contributions

W.V. developed the new features in MaxQuant and Perseus. S.U. performed the tests on the new features. W.V., S.U., D.F., J.C. and J.X. conceptualized the project and designed the new features. J.C. and J.X. supervised the project and drafted the manuscript. All authors have read and approved the final manuscript.

Peer review

Peer review information

Nature Communications thanks Yasset Perez-Riverol and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Data availability

No new mass spectrometry proteomics data was generated in the scope of this work. The 24-fraction TMT10 dataset displayed in Figs. 2 and 3 is a subset of the dataset from the CPTAC data portal under accession number S06025 and was obtained from Proteomic Data Commons (PDC) repository using bash script from https://github.com/esacinc/PDC-Public/tree/master/tools/downloadPDCData.

Code availability

Both MaxQuant and Perseus are freeware that can be used for all purposes, including commercial use. The partially open-source code was deposited at https://github.com/JurgenCox/mqtools7, under the CC BY-NC-ND 4.0 license. The scripts for annotating the output tables were created using Python v3.12.3 and R v4.5.0, and were deposited at https://github.com/cox-labs/converters, under the CC BY-NC-ND 4.0 license.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Walter Viegener, Shamil Urazbakhtin.

Contributor Information

Jürgen Cox, Email: cox@biochem.mpg.de.

Jinqiu Xiao, Email: jxiao@biochem.mpg.de.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-025-64089-4.

References

  • 1.Guo, T., Steen, J. A. & Mann, M. Mass-spectrometry-based proteomics: from single cells to clinical applications. Nature638, 901–911 (2025). [DOI] [PubMed] [Google Scholar]
  • 2.Eliuk, S. & Makarov, A. Evolution of orbitrap mass spectrometry instrumentation. Annu Rev. Anal. Chem.8, 61–80 (2015). [DOI] [PubMed] [Google Scholar]
  • 3.Meier, F., Park, M. A. & Mann, M. Trapped ion mobility spectrometry and parallel accumulation-serial fragmentation in proteomics. Mol. Cell Proteom.20, 100138 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gillet, L. C. et al. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell Proteom.11, O111 016717 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Guzman, U. H. et al. Ultra-fast label-free quantification and comprehensive proteome coverage with narrow-window data-independent acquisition. Nat. Biotechnol.42, 1855–1866 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Domon, B. & Aebersold, R. Challenges and opportunities in proteomics data analysis. Mol. Cell Proteom.5, 1921–1926 (2006). [DOI] [PubMed] [Google Scholar]
  • 7.Pappireddi, N., Martin, L. & Wuhr, M. A review on quantitative multiplexed proteomics. Chembiochem20, 1210–1224 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Manadas, B., Mendes, V. M., English, J. & Dunn, M. J. Peptide fractionation in proteomics approaches. Expert Rev. Proteom.7, 655–663 (2010). [DOI] [PubMed] [Google Scholar]
  • 9.Griss, J., Perez-Riverol, Y., Hermjakob, H. & Vizcaíno, J. A. Identifying novel biomarkers through data mining-A realistic scenario? Proteom. Clin. Appl9, 437–443 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Dai, C. X. et al. A proteomics sample metadata representation for multiomics integration and big data analysis. Nat. Commun.12, 5854 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Vizcaino, J. A. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol.32, 223–226 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Claeys, T. et al. lesSDRF is more: maximizing the value of proteomics data through streamlined metadata annotation. Nat. Commun.14, 6743 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods14, 513–520 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Dai, C. et al. quantms: a cloud-based pipeline for quantitative proteomics enables the reanalysis of public proteomics data. Nat. Methods21, 1603–1607 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Zheng, P. et al. Ibaqpy: a scalable Python package for baseline quantification in proteomics leveraging SDRF metadata. J. Proteom.317, 105440 (2025). [DOI] [PubMed] [Google Scholar]
  • 16.Ferretti, D. et al. Isobaric labeling update in MaxQuant. J. Proteome Res.24, 1219–1229 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Sinitcyn, P. et al. MaxDIA enables library-based and library-free data-independent acquisition proteomics. Nat. Biotechnol.39, 1563–1573 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol.26, 1367–1372 (2008). [DOI] [PubMed] [Google Scholar]
  • 19.Tyanova, S. et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods13, 731–740 (2016). [DOI] [PubMed] [Google Scholar]
  • 20.Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol.15, 550 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res.43, e47 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics26, 139–140 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Rayner, T. F. et al. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinforma.7, 489 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Fullgrabe, A. et al. Guidelines for reporting single-cell RNA-seq experiments. Nat. Biotechnol.38, 1384–1386 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Krug, K. et al. Proteogenomic landscape of breast cancer tumorigenesis and targeted therapy. Cell183, 1436–1456 e1431 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information (759.3KB, pdf)
41467_2025_64089_MOESM2_ESM.pdf (23.1KB, pdf)

Description of Additional Supplementary Data

Supplementary Data 1 (10.3KB, txt)
Reporting Summary (847.8KB, pdf)

Data Availability Statement

No new mass spectrometry proteomics data was generated in the scope of this work. The 24-fraction TMT10 dataset displayed in Figs. 2 and 3 is a subset of the dataset from the CPTAC data portal under accession number S06025 and was obtained from Proteomic Data Commons (PDC) repository using bash script from https://github.com/esacinc/PDC-Public/tree/master/tools/downloadPDCData.

Both MaxQuant and Perseus are freeware that can be used for all purposes, including commercial use. The partially open-source code was deposited at https://github.com/JurgenCox/mqtools7, under the CC BY-NC-ND 4.0 license. The scripts for annotating the output tables were created using Python v3.12.3 and R v4.5.0, and were deposited at https://github.com/cox-labs/converters, under the CC BY-NC-ND 4.0 license.


Articles from Nature Communications are provided here courtesy of Nature Publishing Group

RESOURCES