Abstract
We describe BaMORC, a software package that performs 13C chemical shifts reference correction for either assigned or unassigned peak lists derived from protein NMR spectra. BaMORC provides an intuitive command line interface that allows non-NMR experts to detect and correct 13C chemical shift referencing errors of unassigned peak lists at the very beginning of NMR data analysis, further lowering the bar of expertise required for effective protein NMR analysis. Furthermore, BaMORC provides an application programming interface for integration into sophisticated protein NMR data analysis pipelines, both before and after the protein resonance assignment step.
Keywords: Protein NMR, Chemical shift reference correction, Software package
Chemical shifts derived from protein NMR spectra have a wide variety of uses including protein structure determination [1,2], characterizing ligand binding [3–5], and drug discovery and design [6,7]. However, deriving accurate chemical shift values requires the referencing of NMR spectra to a certain standard, typically an internal standard [8,9]. Due to human errors and a variety of experimental factors [10,11], errors occur quite frequently in 13C protein NMR data. An estimated 40% of the entries in the Biological Magnetic Resonance Bank (BMRB) have referencing issues [12]. The resulting referencing discrepancies are highly problematic since prior methods for reference correction required either assignment and/or structure [13,14], which are the exact downstream aims that reference correction is trying to target. This leads to a co-dependency between reference correction and NMR structure determination, crippling the progress of many protein NMR analyses.
We therefore developed the Bayesian Model Optimized Reference Correction (BaMORC) method [15] that helps non-expert scientists to detect and correct 13C Cα and Cβ chemical shifts, at the beginning of the protein NMR analysis process, when chemical shifts are unassigned. Here we describe the BaMORC method implemented in an easy-to-use software package written in the R programming language. BaMORC uses a Bayesian model to estimate an amino acid frequency from Cα and Cβ chemical shift statistics inferred from the Re-referenced Protein Chemical shift Database (RefDB) [12], with or without resonance assignment information. As shown in Figure 1, by optimizing the minimal between the actual amino acid frequency calculated from known protein sequence and an estimation based on the observed chemical shifts, BaMORC returns the reference correction value and re-referenced chemical shifts data. Figure 2 illustrates the required input and expected output generated by the BaMORC R package.
Figure 1:

Overview of the (unassigned) BaMORC algorithm.
Figure 2:

Input utilized and output generated by the BaMORC R package.
The BaMORC R package provides a command-line interface (CLI) for general use and an application programming interface for users that are familiar with R programming, especially for use within an integrated development environment like RStudio [16]. As illustrated in Figure 2, the BaMORC R package can use the protein sequence and chemical shifts in a variety of unassigned and assigned formats including the NMR-STAR format utilized by the BMRB. As illustrated in Figure 2, the general row-based text format may be delimited by comma or white space, but with the protein sequence on the first line followed by unassigned peaks or assigned Cα and Cβ chemical shift pairs on following rows.
Each input file is referred to as a “task” within a larger “job”. The BaMORC R package automatically interfaces with the registration, grouping and referencing algorithms to set up tasks and derive the most optimized correction values for a given input, and returns the corrected chemical shifts in csv format. The package can also accept a BMRB ID such as BMR 4020 as input to retrieve corresponding files from the BMRB web server, automatically parsing the file, correcting the referencing, and returning the same set of output as mentioned before.
We have evaluated BaMORC against 568 13C protein NMR datasets from the RefDB with 90% or higher completeness with respect to Cα and Cβ chemical shift assignments. Outputted reference correction values should match closely to 0 ppm, since each dataset from RefDB has been reference corrected using protein structure information. With chemical shift assignments, BaMORC provides reference correction values within +/− 0.50 ppm for all datasets and within +/− 0.22 ppm for 90% of the datasets, representing a 90% Confidence Interval (CI) of 0.40 ppm (Figure 3) [15]. This level of performance is superior to the prior state of the art LACS method [14].
Figure 3:

Comparison of assigned BaMORC to the LACS method.
However, in the real-world situation, 13C reference correction is most valuable before protein resonance assignments are known. This situation is what the BaMORC package was really designed to address. The unassigned BaMORC method has two major components, grouping and referencing correction. With an input peak list, the grouping algorithm will return a list of Cα and Cβ grouped peaks (spin systems) as output, which will be the input for the referencing correction algorithm, as shown in Figure 2. The grouping algorithm is a variance-informed DBSCAN algorithm that employs derived dimensions-specific match tolerance values to group peaks into spin systems. A peak list registration step is used to derive the necessary match tolerance values [17]. In addition to the grouped peaks, the referencing correction component uses the JPred4 [18] server to generate sequence-based secondary structure predictions and then calculates the reference correction.
Again we used the same 568 13C protein NMR datasets from the RefDB to evaluate the reference correction component of unassigned BaMORC, but without chemical shift assignments. As shown in Figure 4, the reference correction component of unassigned BaMORC provides reference correction values within +/− 0.45 ppm for 90% of the datasets, representing a 90% CI of 0.69 ppm [15]. This suggests that the unassigned BaMORC algorithm can achieve the same level of performance when handling unassigned 13C protein NMR peak list data. This level of real-world performance is demonstrated with a set of peak lists derived from solution NMR HN(CO)CACB spectra for 10 different proteins. In this real-world evaluation, unassigned BaMORC provided reference correction values all within +/− 0.40 ppm [15].
Figure 4.

Unassigned BaMORC reference correction accuracy.
Experimental
Software:
The Python programing language, version 3.6, is used for the grouping algorithm. The R programming language, version 3.4, is used for the BaMORC core component. The library dependencies are listed below:
Python Library Dependencies: Python (>=3.6), gcc (>=5.1)
R Library Dependencies: R (>=3.4), data.table, tidyr, DEoptim, httr, docopt, stringr, jsonlite, readr, devtools, RBMRB, BMRBr
Experimental data sources:
We used data from the RefDB to derive chemical shift statistics sed within the BaMORC package. For testing and evaluation, we used datasets from the RefDB and experimental peak lists from a variety of sources.
Installation:
To use the BaMORC package, users must first install the R 3.4.x (or higher version) and Python 3.6.x (or higher version) interpreters on their machine. For Linux distributions, this is typically accomplished through the distribution’s package management system. For other operating systems, installation may require a more manual procedure. R language is a language and environment for statistical computing [19]. The installation guide is located in the website of the comprehensive R Archive Network [https://cran.r-project.org/]. Python language [20] can be install from this website [https://www.python.org/].
Installing BaMORC from the command line (Linux and Mac only):
To use BaMORC, the user first needs to install the package from the GitHub or CRAN.
$ wget -q https://cran.r-project.org/src/contrib/BaMORC_<version>.tar.gz $ sudo R CMD INSTALL BaMORC_<version>.tar.gz
Install from command line via R console
$ R # to start R console > install.packages(“BaMORC”)
Install from R console
> install.packages(“BaMORC”)
Installing unassigned BaMORC dependencies:
The unassigned BaMORC analysis requires the ssc (Spin System Creator) package, which includes a variance-informed implementation of the DBSCAN algorithm used for protein NMR spin system clustering. A docker container including the ssc package is required. Therefore, the user needs to install both docker and SSC docker image.
Install Docker from https://www.docker.com/products/docker-desktop.
Install SSC docker container after docker is installed by running following code:
> docker pull moseleybioinformaticslab/ssc.
The BaMORC application programming interface (API):
After import the BaMORC in R either on R Console or in RStudio, the user will first read in NMR chemical shifts data via the read_file function with parameters of file path, file delimiter, and a flag that indicates whether data is either assigned or unassigned. BaMORC currently support file delimiters of comma, semicolon and whitespace. For users who want to run an analysis on an existing dataset from the BMRB (NMR-STAR version 2 and 3), they can use either the read_nmrstar_file function with a parameter for a local file path or the read_db_file function with a parameter for the BMRB ID and a flag that indicates whether data are assigned or unassigned. If read_db_file is used, BaMORC will utilize the BMRB web API to fetch the corresponding BMRB entry matching the ID. Table 1 shows common usage patterns for reading input data into the BaMORC referencing correction analysis pipeline. For a full list of available conversion options and more detailed examples and documentation of all the functions, please refer to “The BaMORC Reference” and “Quickstart.”
Table 1:
Summary of BaMORC Package Interface (API).
| Command | Description | Example |
|---|---|---|
| read_file | Import local files | input_data = read_file(file_path = “./sample_input.txt”, delim = “ws”, assigned = T) |
| read_nmrstar_file | Import files in NMR-STAR format | input_data = read_nmrstar_file (“BMR4020.str”) |
| read_db_file | Use BMRB ID to import files | input_data = read_db_file(id = ”BMR4020”) |
| bamorc | Using sequence, secondary structure and chemical shift data to estimate the reference correction value | bamorc(sequence, secondary_structure, chemical_shifts_input, from=−5, to=5) |
| unassigned_bamorc | Using only sequence and chemical shift data to estimate the reference correction value | Unassigned_bamorc(sequence, chemical_shifts_input, from=−5, to=5) |
Next, the user will pass the input data as parameters to the bamorc() or unassigned_bamorc()function, which will perform the reference correction analysis. Both functions utilize the output from the read-in functions mentioned above and will perform a secondary structure estimation based on the provided protein sequence if secondary structure information is not provided. Through a series of optimization calculations (for details refer to paper [15]), bamorc() and unassigned_bamorc() will return the estimated referencing correction value in a plain text file and corrected chemical shifts for both Cα and Cβ as a table, as shown in Figure 2. The user can optionally customize the search range. Table 1 contains a basic example of calling each function. For detailed examples and expected outputs of BaMORC API functions, please refer to the online documentation: https://moseleybioinformaticslab.github.io/BaMORC/index.htm.
The BaMORC command line interface (CLI):
The BaMORC CLI is an extension of the BaMORC package, aimed at the broader NMR community that is not familiar with R programming language. To use BaMORC CLI, the user needs to find the CLI run-script first by opening a terminal and typing the command highlighted in Figure 5.
Figure 5:

Finding the CLI run-script location.
> R -e ‘system.file(“exec”, “bamorc.R”, package = “BaMORC”)’
The user can then execute the appropriate command listed in Table 2 to run an analysis. Similar to the package, the BaMORC CLI has three major modules: assigned and unassigned reference correction for assigned and unassigned protein NMR data and a miscellaneous collection of other useful tasks. Table 2 lists the components of the CLI and their associated parameters.
Table 2:
BaMORC CLI commands and their parameters.
| Command | Parameter | Example |
|---|---|---|
| Assigned | ||
| Input file path or ID | --table=sample_input.csv or --bmrb=bmr4020 or --id=BMR4020 |
|
| Report file path | --report=sample_report.txt | |
| Unassigned | Required parameter: | |
| Input file path | --table=sample_input.csv | |
| Optional parameter: | ||
| Grouped peaklist or not | --grouped=true | |
| Protein sequence | --seq=sample_sequence.txt | |
| Search range | --range=(−5,5) | |
| Output path | --output=sample_output.csv | |
| Report file path | --report=sample_report.txt | |
| Help | Help menu | --h or -help |
| Version | Version number | --v or -version |
To help the user transition between the API and CLI, Table 3 illustrates common BaMORC CLI usage examples with corresponding BaMORC API examples. The CLI is utilized within a command line terminal on Linux and Mac computers. For windows user, please refer to our online documentation for more details.
Table 3:
BaMORC CLI usage and corresponding API commands.
| CLI | API |
|---|---|
| Assigned BaMORC: For user’s own protein NMR spectra result | |
| $ bamorc.R assigned --table=./sample_input.csv --ppm_range=(−5,5) --output=./sample_output.csv --delimiter=comma --report=./sample_report.txt | > user_input = read_file(file_path=”./sample_input.csv”, delim=”comma”, assigned=f) > result = bamorc(sequence = user_input[[1]], chemical_shifts_input = user_input[[2]], from = −5, to = 5) |
| Assigned BaMORC: For data in NMR-STAR format | |
| bamorc.R assigned --bmrb=BMR4020.str --ppm_range=(−5,5) --output=./sample_output.csv --delimiter=comma --report=./sample_report.txt | > bmrb_format_data = read_nmrstar_file(“BMR4020.str”) > result = bamorc(sequence = bmrb_format_data[[1]], chemical_shifts_input = bmrb_format_data [[2]], from = −5, to = 5) |
| Assigned BaMORC: For data already existing in BMRB database | |
| bamorc.R assigned --id=BMR4020 --ppm_range=(−5,5) --output=./smple_output.csv --delimiter=comma --report=./sample_report.txt | > existing_data = read_db_file(id=”BMR4020”) > result = bamorc(sequence = existing_data[[1]], chemical_shifts_input = existing_data [[2]], from=−5, to=5) |
| Unassigned BaMORC: For user’s own protein NMR spectra result | |
| bamorc.R unassigned table=./sample_input.csv --ppm_range=(−5,5) --output=./sample_output.csv --delimiter=comma --report=./sample_report.txt | > user_input = read_file(file_path=”./sample_input.csv”, delim=”comma”) > result = unassigned_bamorc(sequence = user_input[[1]], from = −5, to = 5) |
| BaMORC CLI: other commands (CLI only) | |
| bamorc.R valid_ids | To show all the valid BMRB file IDS |
| bamorc.R -h | To show help menu |
| bamorc.R -v | To show BaMORC version |
We have developed online documentations, available at: (https://moseleybioinformaticslab.github.io/BaMORC/index.html).
Acknowledgments -
The authors acknowledge support from the National Science Foundation grant NSF 1252893 (Hunter N.B. Moseley) and National Institutes of Health grants NIH UL1TR001998-01 (Philip Kern) and NIH P30CA177558 (Mark Evers).
Footnotes
Reporting summary: Further information on the algorithms mentioned above and their development is available [15].
Code availability: Source code is available at https://github.com/MoseleyBioinformaticsLab/BaMORC. [The package has been submitted to CRAN and should be available from CRAN soon. We will add a sentence about its availability from CRAN and update installation instructions when the evaluation process is finished]. The code is published under a modified open source BSD-3 license. Academic researchers are free to use it without restriction, except for proper citation. This repository includes code for the BaMORC referencing correction pipeline. For the registration and grouping algorithm, please refer to https://github.com/MoseleyBioinformaticsLab/ssc [21]. For further information and assistance please visit our laboratory website: http://bioinformatics.cesb.uky.edu.
Data availability: Datasets are available at: https://doi.org/10.6084/m9.figshare.5270755.v1
References
- [1].Sattler M, Schleucher J, Griesinger C. (1999) Heteronuclear multidimensional NMR experiments for the structure determination of proteins in solution employing pulsed field gradients. Progress in Nuclear Magnetic Resonance Spectroscopy, 34, 93–158. [Google Scholar]
- [2].Shen Y, Lange O, Delaglio F, Rossi P, Aramini JM, Liu G, Eletsky A, Wu Y, Singarapu KK, Lemak A, Ignatchenko A. (2008) Consistent blind protein structure generation from NMR chemical shift data. Proceedings of the National Academy of Sciences of the United States of America, 105, 4685–4690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Williamson MP. (2013) Using chemical shift perturbation to characterise ligand binding. Progress in Nuclear Magnetic Resonance Spectroscopy, 73, 1–16. [DOI] [PubMed] [Google Scholar]
- [4].Jayalakshmi V, Krishna NR. (2004) CORCEMA refinement of the bound ligand conformation within the protein binding pocket in reversibly forming weak complexes using STD-NMR intensities. Journal of Magnetic Resonance, 168, 36–45. [DOI] [PubMed] [Google Scholar]
- [5].Moseley HN, Curto EV, Krishna NR. (1995) Complete relaxation and conformational exchange matrix (CORCEMA) analysis of NOESY spectra of interacting systems; two-dimensional transferred NOESY. Journal of Magnetic Resonance. Series B, 108, 243–261. [DOI] [PubMed] [Google Scholar]
- [6].Anderson AC. (2003) The process of structure-based drug design. Chemistry & Biology, 10, 787–797. [DOI] [PubMed] [Google Scholar]
- [7].Shuker SB, Hajduk PJ, Meadows RP, Fesik SW. (1996) Discovering high-affinity ligands for proteins: SAR by NMR. Science, 274 (5292), 1531–1534. [DOI] [PubMed] [Google Scholar]
- [8].Markley JL, Bax A, Arata Y, Hilbers CW, Kaptein R, Sykes BD, Wright PE, Wüthrich K. (1998) Recommendations for the presentation of NMR structures of proteins and nucleic acids–IUPAC-IUBMB-IUPAB Inter-Union Task Group on the standardization of data bases of protein and nucleic acid structures determined by NMR spectroscopy. Journal of Biomolecular NMR, 12, 1–23. [DOI] [PubMed] [Google Scholar]
- [9].Wishart DS, Bigam CG, Yao J, Abildgaard F, Dyson HJ, Oldfield E, Markley JL, Sykes BD. (1995) 1H, 13C and 15N chemical shift referencing in biomolecular NMR. Journal of Biomolecular NMR, 6, 135–140. [DOI] [PubMed] [Google Scholar]
- [10].Nowick JS, Khakshoor O, Hashemzadeh M, Brower JO (2003) DSA: A new internal standard for NMR studies in aqueous solution. Organic Letters, 5, 3511–3513. [DOI] [PubMed] [Google Scholar]
- [11].Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, Ioannidis YE, Lin J, Livny M, Mading S, Maziuk D, Miller Z, Nakatani E. (2008) BioMagResBank. Nucleic Acids Research, 36 (Database issue), D402–408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Zhang H, Neal S, Wishart D. (2003) RefDB: a database of uniformly referenced protein chemical shifts. Journal of Biomolecular NMR, 25, 173–195. [DOI] [PubMed] [Google Scholar]
- [13].Han B, Liu Y, Ginzinger SW, Wishart DS. (2011) SHIFTX2: significantly improved protein chemical shift prediction. Journal of Biomolecular NMR, 50, 43–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Wang L, Eghbalnia HR, Bahrami A, Markley JL. (2005) Linear analysis of carbon-13 chemical shift differences and its application to the detection and correction of errors in referencing and spin system identifications. Journal of Biomolecular NMR, 32, 13–22. [DOI] [PubMed] [Google Scholar]
- [15].Chen X, Smelter A, Moseley HNB. (2018) Automatic 13C chemical shift reference correction for unassigned protein NMR spectra. Journal of Biomolecular NMR. 72, 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].RStudio RT. (2015) RStudio: Integrated Development for R. RStudio, Inc., Boston, MA. [Google Scholar]
- [17].Smelter A, Astra M, Moseley HNB. (2017) A fast and efficient python library for interfacing with the Biological Magnetic Resonance Data Bank. BMC Bioinformatics, 18 (1), 175–186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Drozdetskiy A, Cole C, Procter J, Barton GJ. (2015) JPred4: a protein secondary structure prediction server. Nucleic Acids Research. 43(W1), W389–W394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Team RC. (2018) R: A Language and Environment for Statistical Computing.
- [20].Van Rossum G, Drake FL. (2011) The Python Language Reference Manual. Network Theory Ltd. [Google Scholar]
- [21].Smelter A, Rouchka EC, Moseley HNB. (2017) Detecting and accounting for multiple sources of positional variance in peak list registration analysis and spin system grouping. Journal of Biomolecular NMR, 68, 281–296. [DOI] [PMC free article] [PubMed] [Google Scholar]
