Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2022 Oct 6;18(10):e1010258. doi: 10.1371/journal.pcbi.1010258

Linear discriminant analysis reveals hidden patterns in NMR chemical shifts of intrinsically disordered proteins

Javier A Romero 1,#, Paulina Putko 1,#, Mateusz Urbańczyk 2, Krzysztof Kazimierczuk 1,*, Anna Zawadzka-Kazimierczuk 3,*
Editor: Anna R Panchenko4
PMCID: PMC9578625  PMID: 36201530

Abstract

NMR spectroscopy is key in the study of intrinsically disordered proteins (IDPs). Yet, even the first step in such an analysis—the assignment of observed resonances to particular nuclei—is often problematic due to low peak dispersion in the spectra of IDPs. We show that the assignment process can be aided by finding “hidden” chemical shift patterns specific to the amino acid residue types. We find such patterns in the training data from the Biological Magnetic Resonance Bank using linear discriminant analysis, and then use them to classify spin systems in an α-synuclein sample prepared by us. We describe two situations in which the procedure can greatly facilitate the analysis of NMR spectra. The first involves the mapping of spin systems chains onto the protein sequence, which is part of the assignment procedure—a prerequisite for any NMR-based protein analysis. In the second, the method supports assignment transfer between similar samples. We conducted experiments to demonstrate these cases, and both times the majority of spin systems could be unambiguously assigned to the correct residue types.

Author summary

Intrinsically disordered proteins dynamically change their conformation, which allows them to fulfil many biologically significant functions, mostly related to process regulation. Their relation to many civilization diseases makes them essential objects to study. Nuclear magnetic resonance spectroscopy (NMR) is one of the methods for such research, as it provides atomic-scale information on these proteins. However, the first step of the analysis – assignment of experimentally measured NMR chemical shifts to particular atoms of the protein – is more complex than in the case of structured proteins. The methods routinely used for these proteins are no more sufficient. We have developed a method of resolving ambiguities occurring during the assignment process.

In a nutshell, we show that an advanced statistical method known as linear discriminant analysis makes it possible to determine chemical shift patterns specific to different types of amino acid residues. It allows assigning the chemical shifts more efficiently, opening the way to a plethora of structural and dynamical information on intrinsically disordered proteins.


This is a PLOS Computational Biology Methods paper.

Introduction

Intrinsically disordered proteins (IDPs) play an essential biological role in eukaryotes, being involved in differentiation, transcription regulation, spermatogenesis, mRNA processing and many other processes [1]. Research into IDPs is thus crucial, but it is also very challenging, as the high mobility of the polypeptide chain prevents crystallization, hampering the use of X-ray crystallography. This dynamic behavior by IDPs also prevents the use of cryogenic electron microscopy [2]. In light of this, nuclear magnetic resonance spectroscopy (NMR) remains the most appropriate method for atomic-level analysis, providing information on structure, dynamics and interactions with other molecules.

The most important observables in NMR are the resonance frequencies of nuclear magnetic moments placed in an external magnetic field. These frequencies, typically expressed in a chemical shift scale, depend on the local moieties of the nuclei. In particular, they are characteristic of the different amino acid residue types in a protein. In the case of folded proteins, the chemical shifts are further influenced by the secondary structure, but for IDPs the effect is far weaker [3]. Although IDPs do not behave in a purely random way and often form “compact states” [4], these states are only adopted transiently. For this reason the structure-induced chemical shift effects are averaged, resulting in spectra that can be very crowded and difficult to analyze.

The assignment of NMR signals to the nuclei of a protein is based on an analysis of a set of heteronuclear (1H, 13C, 15N) spectra that provide information about the sequential connectivities of chemical shifts. Such a set can include standard three-dimensional spectra such as HN(CA)CO or HNCA [5, 6], for example, but it can also use those of higher dimensionality (4D and more) to resolve signal overlap [7, 8]. The resonance assignment procedure comprises several steps (see Fig 1). The first step is peak picking, followed by the formation of spin systems. Each spin system contains information on the chemical shifts of nuclei interacting via scalar coupling (typically belonging to two adjacent residues, i and i-1, although some experiments can reach further [8]). To this end, peaks from different spectra that share certain resonance frequencies must be gathered. Next, by finding identical chemical shifts in different spin systems, sequential connectivities are established and spin-system chains are formed. Although the protein’s primary structure is a single, long linear chain of amino acid residues, the analysis of sequential connectivities in NMR spectra almost always leads to the formation of many shorter chains. The chains are interrupted when chemical shifts are found to be missing from the linking spectra, connectivities are ambiguous, or at proline positions (in the case of amide proton-detected experiments). In the particular case of IDPs, many very short chains appear due to poor peak separation.

Fig 1.

Fig 1

The sequential assignment workflow: A) peak picking for all spectra; B) forming spin systems; C) finding sequential links (nuclei whose chemical shifts are used to find the connections between adjacent amino acid residues are marked with different colors); D) forming chains (the numbers of consecutive spin systems are given in boxes; the label “pre” denotes the residue preceding the formed chain for which some chemical shifts are known; E) amino acid recognition based on characteristic chemical shifts; F) chain mapping onto the protein sequence.

The final step is mapping the formed spin system chains onto the protein sequence. This mapping typically involves identifying characteristic amino acids by using the chemical shift statistics found in the Biological Magnetic Resonance Data Bank (BMRB), for example [9]. IDP-tailored statistics are also available [10], providing additional information about the influence of adjacent residues on the chemical shifts. The dependence of chemical shifts on neighbors further away (i-2, i+2), pH, ionic strength and temperature can also be exploited [11]. Typically, Cβ chemical shifts are used for mapping, with further assistance from Hβ, if available, or Cα. Certain amino acid residue types can be excluded based on the structure of the residue [12]. For example, the presence of Hβ or Cβ chemical shifts indicates a non-glycine residue. The presence of two different Hβ chemical shifts excludes alanine, isoleucine, threonine and valine, as these residues do not contain chemically inequivalent Hβ protons. In short, we can easily recognize alanine, glycine, serine and threonine, but more detailed analysis is needed for other amino acid residue types. Some of the above-mentioned recognition procedures are embedded in automatic assignment programs [1316]. Clearly, the longer the chain in question, the easier the mapping step. In the case of shorter chains, which are very common in IDP analysis, it is essential to recognize the amino acids of as many residues as possible in order to map the chains accurately.

Fig 2 shows the distributions of the chemical shifts in a set of 17 IDPs from the BMRB. Clearly, in most cases, using just two chemical shifts for amino-acid recognition is insufficient: On two-dimensional planes, regions corresponding to different types of residues often overlap partially or even completely. Although some residue types can be clearly recognized—alanine, glycine, serine and threonine, for instance—recognizing others can be problematic. The Cβ-Hβ plane turns out to be the best choice for grouping residue types into spectral regions. Nonetheless, even here the regions are not fully separated. Adding more chemical shifts would improve the separation of regions, but also complicate visualization and manual analysis. So, to fully exploit the rich statistical information available, the recognition should be assisted with an algorithm that operates easily in a multidimensional space.

Fig 2. Chemical shifts of 17 unfolded proteins from the BMRB, marked with colors according to their amino acid residue types: A) Cα/CO plane; B) Cα/Cβ plane; C) Cα/Hα plane; D) Cβ/Hβ plane.

Fig 2

Although the chemical shifts are specific to the residue types, none of the 2D planes provides satisfactory separation allowing unambiguous assignment.

In this paper, we attempt to develop just such an algorithm. Below, we propose a statistical method based on Linear Discriminant Analysis (LDA) for the automatic recognition of amino acid residue types. It is worth mentioning that this method can be integrated into automatic assignment programs to facilitate the mapping step. Yet, it is not the only application for LDA, as shown in the results section. Of note, other classification methods were also tested (as shown in S1 Text and S1 Fig), but LDA obtained the highest performance scores among all of them. To the best of our knowledge, such a method has only been used once before in protein NMR, to detect beta-hairpin regions [17].

Materials and methods

Linear discriminant analysis for the classification of amino acid residue types

The mapping procedure described above can be regarded as a classification problem, as the aim is to assign amino acid residue types to different spin systems. This classification is based on the variables that define spin systems, namely the set of chemical shifts of nuclei belonging to the residue in question, measured in one or more experiments. LDA is a classification method well suited for this purpose.

LDA is related to the more popular method known as principal component analysis (PCA). Both methods look for linear combinations of variables that best explain variance in the data. But while PCA finds new coordinates that maximize the variance of the data, LDA maximizes the variance between the different classes (residue types) and at the same time minimizes the variance within each class [18]. Another important distinction is that in PCA the directions of maximal variance do not depend on the classes, so residue types are not taken into account. By contrast, LDA uses an already classified dataset (the training set) to explicitly attempt to model the difference between the classes. Once adequately trained, the model can then be used to classify the spin systems of an unassigned protein.

An LDA classification model comprises discriminant functions that appear based on the linear combination of predictive variables providing the best separability between classes. These functions are derived from the training set, in which the classifications of the spin systems are known. The training set can be thought of as an N × (M + 1) table, with NMR data from N spin systems defined by M chemical shift values, where the last column of the table gives the residue types. For the purpose of training the model, the LDA method assumes that the scattering of the chemical shift values of the spin systems belonging to each of the amino acid residue types can be accurately characterized by normal distributions [19]. In this way, the spin systems from the k-th residue type are described by a mean vector μk and a covariance matrix Σk, given by:

μk=1nki=1nkxiΣk=1nki=1nk(xi-μk)(xi-μk) (1)

where xi is a vector representing the i-th spin system, with dimensionality equal to the number of chemical shifts (M), and nk is the total number of spin systems of amino acid residue type k in the training set. To build the LDA classification model, we further assume that the covariance matrices for all classes are the same [18]. This common covariance matrix is usually called a pooled covariance matrix, and is given by:

Σpooled=1Nini·Σi (2)

where the summation runs through all the residue types. If we have a total of 20 amino acid types in the training set, the classification model consists of 20 discriminant functions {f1, f2, …, f20}. These functions define the regions of maximal probability for each residue type in the multidimensional chemical shift space, which may in particular correspond to a single spectrum. Equal probabilities between classes are used as boundary conditions and define hyperplanes that separate each cluster within a class. For new, unassigned spin systems we get probabilities corresponding to each of the amino-acid residue types. The classification score needed for a new spin system x to belong to residue type k is given by [20]:

fk(x)=μkΣpooled-1x-12μkΣpooled-1μk+ln(πk) (3)

where we define the a priori probability as πk = nk/N, and N is the total number of spin systems in the training set. Eq 3 is a linear equation on the variable x, and hence the class boundary is a hyperplane of linear shape. Within a single region of this type, a given discriminant function has a higher classification score than all other functions, and all resonances that fall inside this region are classed as belonging to the corresponding residue type.

Importantly, this method assumes that the distribution of the chemical shifts in the mapped protein is similar to that of the training set. For resonance mapping in IDPs, this requirement can be met by choosing, for the training set, proteins described in the BMRB database as “unstructured”, “unfolded” or “disordered”. Our training set consisted of 1,613 spin systems from 17 such proteins (BMRB Entry IDs: 6436, 11526, 15176, 15179, 15180, 15201, 15225, 15430, 15883, 15884, 16296, 16445, 17290, 17483, 19258, 25118 and 30205). The training set must contain samples from all the amino acid residue types present in the IDP that is to be mapped. Of course, if certain residue types are missing from the protein under investigation, the spin systems corresponding to residues of this type should be removed from the training set, as should all spin systems that lack any of the chemical shifts that we choose to use for discrimination.

Classification models lacking the assumption of Eq 2 are called Quadratic Discriminant Analysis (QDA) models [21]. Initially, one might assume that a QDA model would outperform its LDA sibling. However, in S1 Text and S1 Fig we show that LDA is actually the better choice for protein mapping. In SI we also compare the performance of LDA to two other well-known classification methods: “k-nearest neighbors” and “support vector machines”. LDA scores highest in mean accuracy, sensitivity and specificity (although not by a large margin), while at the same time it grades lowest in variance of accuracy across all 17 training proteins.

To summarize our approach, although other methods can achieve a classification accuracy similar to that of LDA, we choose to use LDA because it demonstrates the lowest variance in classification accuracy after performing cross-validation with different proteins. LDA therefore allows for more consistent predictions across different IDPs.

Sample preparation

The 13C,15N-uniformly labeled α-synuclein was expressed as described by [22]. The sample concentration was 1.35 mM in 20 mM sodium phosphate buffer at pH 6.5. The buffer contained 200 mM NaCl, 0.5 mM EDTA, 0.02% NaN3 and a Protease Inhibitor Cocktail (Roche). 10% D2O was added for lock.

NMR spectroscopy

All experiments were performed at a temperature of 288.5 K on an Agilent 700 MHz spectrometer equipped with a 5 mm HCN room-temperature probe and DD2 console. The experiments—3D HNCO [23], 4D HNCACO [24], 4D HabCab(CO)NH [25, 26] and 4D (H)N(CA)CONH [26]—were acquired using non-uniform sampling. 3D data was processed using a multidimensional Fourier transform [27] and 4D data was processed using a sparse multidimensional Fourier transform [26], with HNCO as a basis spectrum. Spin systems were formed by gathering data from the peaks appearing on cross-sections corresponding to individual basis spectrum peaks. In the case of overlap of the basis spectrum peaks, when on a given cross-section appeared peaks from more than one spin system, the discrimination of spin systems was performed according to recommendations described in [13]: peaks were regarded as belonging to a cross-section on which they had higher absolute intensity. Using 3D HNCO (instead of 15N-HSQC) as a basis spectrum made this approach more efficient; amide proton and nitrogen chemical shifts were determined more accurately. Thus, the overlapping cross-sections were more shifted than they would be if 2D basis spectrum was used. The experimental parameters are shown side-by-side in Table 1. All spectra were displayed and analyzed using the Sparky program [28]. The experimental data—raw signals and Sparky spectral files—are available at zenodo.org (10.5281/zenodo.7032142).

Table 1. Experimental parameters: ni (number of hypercomplex increments), nuc (name of nucleus), t (maximum evolution time in ms) and sw (spectral width in Hz).

Experimental time is given in hours.

Experiment ni Dim 1 Dim 2 Dim 3 Exp. time
nuc t sw nuc t sw nuc t sw
3D HNCO 1600 CO 50 2800 N 75 2500 - - - 12
4D HNCACO 2200 CO 50 2800 Cα 10 6200 N 75 2500 33
4D HabCabCONH 1450 Hαβ 10 7500 Cαβ 7.1 14000 N 75 2500 22
4D (H)N(CA)CONH 2500 N 28 2500 CO 28 2800 N 75 2500 40

Results and discussion

To use the proposed approach to solve the practical challenges of IDP resonance assignment, we first need to answer the following research questions: Is the training set of BMRB entries mentioned above consistent? Can it be used to classify unknown spin systems? What is the optimal set of chemical shifts that provides efficient discrimination? Can this method assist the chain-mapping step in the assignment procedure? And can it help in other resonance assignment problems, such as peak list transfer between the spectra of samples measured under slightly different conditions? We address these questions in this section.

All the BMRB entries that we used for training contain seven chemical shifts: HN, N, CO, Cα, Cβ, Hα and Hβ. Based on these chemical shifts, we constructed three training sets, corresponding to typical experimental setups (see Section Optimal set of chemical shifts): subset (i) HN, N, Cα and CO; subset (ii) Cα, Cβ, Hα and Hβ; and subset (iii), which was actually the complete set, containing all seven chemical shifts. Some spin systems in the testing set may be incomplete, that is to say, lacking certain chemical shifts; this is usually the case for the residue preceding the formed chain of spin systems. To classify such spin systems we used a separate, restricted training set consisting of those chemical shifts that are present.

Consistency of the training data

We evaluated the consistency of the training dataset (17 proteins from the BMRB) by performing a leave-one-out cross-validation using subset (iii): Train the classification model using the NMR data from 16 proteins and test it on the remaining protein, then repeat the process swapping the test protein until all proteins have been used for testing. Fig 3 shows the accuracy of amino acid type recognition by LDA, defined as the number of correct classifications over the total number of classifications. We found the weighted mean accuracy to be 89.43%, weighted by the number of residues in the proteins. It is worth noting that this level of accuracy is high, despite the fact that several of the proteins contained numerous residues lacking one or more chemical shifts values.

Fig 3. Classification accuracy for LDA using HN, N, CO, Cα, Cβ, Hα and Hβ chemical shifts.

Fig 3

We performed leave-one-out cross-validation with NMR data from the 17 proteins, downloaded from the BMRB. Amino acid distributions are shown as percentages of each type present in each protein.

This high level of accuracy leads us to the conclusion that the selected BMRB entries do indeed show similar chemical shift distributions for the same residue types. It is therefore not unreasonable to suggest that the chemical shifts of a new, as yet unassigned IDP will be correctly classified using this training data. It is worth mentioning that chemical shift values of training and test proteins are normalized simultaneously using Pareto scaling to correct for possible referencing errors.

Optimal set of chemical shifts

Some chemical shifts are more important than others in the classification process, that is to say, they have greater predictive power. We measured the importance of the chemical shifts involved to evaluate their impact on the accuracy of the model (Fig 4). To do this, we first trained the model with the complete N × (M + 1) table, representing the entire training set. Next, we randomly shuffled a single M column (chemical shift value) from the training set. Finally, we used the model that we had previously trained to classify the set with one shuffled column. Randomly shuffling one column removes the correlation between the chemical shift values and the amino acid type for each spin system, while preserving the descriptive statistical information in the column. By measuring the error rate in the classification (defined as 1 minus the accuracy), we can evaluate which chemical shifts are more important for making accurate classifications.

Fig 4. Importance of different chemical shifts for the classification models, measured as the error rate of LDA models when making classifications on a test set with a shuffled column corresponding to a single chemical shift.

Fig 4

The results shown are the mean values from performing 1,000 tests on each chemical shift, with error bars representing one standard deviation.

The conclusions in Fig 4 form the basis of the model that we present here. It is well known from statistics [29] that Cβ and Hβ, followed by Cα, are the most characteristic chemical shifts. Yet, the data in Fig 4 shows that the aggregated effect of the other chemical shifts is not negligible, resulting in a complex seven-dimensional pattern that is impossible to analyze manually.

When choosing the optimal chemical shift set, we have to take into account the limitations of available NMR experiments. These experiments differ in their dimensionality, sensitivity and established correlations. What we want is a set of experiments providing information about the desired chemical shifts that can be obtained in a minimum amount of experimental time. Our goal is to obtain sets of chemical shifts that characterize different residues. Typically, however, the triple-resonance experiment providing intra-residual correlations also gives additional inter-residual peaks. For example, 4D HNCACO provides the desired correlation HiN-Ni-Cαi-COi (subset (i)) but it also provides HiN-Ni-Cαi−1-COi−1. For our purpose, the two types of correlations need to be distinguished by means of an extra experiment, namely 4D HNCOCA, that contains only the latter, inter-residual types of peaks. On the other hand, we can obtain the chemical shifts for subset (ii) with just a single experiment, namely 4D HabCab(CO)NH. In this case, the four chemical shifts (Cα, Cβ, Hα and Hβ) correspond to the residue preceding the correlated amide group. The experimental setups providing the chemical shifts of the aforementioned subsets (i), (ii) and (iii) are presented side-by-side in Table 2.

Table 2. Proposed sets of experiments needed to obtain the described chemical shift subsets (i)-(iii).

In subset (iii), 3D HNCO is required for SMFT processing [26].

(i) HN, N, Cα, CO (ii) Cα, Cβ, Hα, Hβ (iii) HN, N, CO, Cα, Cβ, Hα, Hβ
4D HNCACO 4D HabCab(CO)NH 3D HNCO
4D HNCOCA 4D HabCab(CO)NH
4D (H)N(CA)CONH

We also performed the leave-one-out analysis described in Section Consistency of the training data to evaluate how well the method performed for subsets (i), (ii) and (iii). Fig 5 shows the confusion charts (average LDA probability matrix) for different amino acid types. Even for subset (i), the overwhelming majority of the classifications were correct. However, the results for subset (ii) were much better, that is to say, less ambiguous. The results for subset (iii) are slightly better than for subset (i), but not as good as those for subset (ii). We may therefore conclude that subset (ii) is the optimal choice for LDA analysis.

Fig 5. Confusion charts for the proposed chemical shift sets.

Fig 5

The charts result from performing leave-one-out cross-validation of the 17 proteins from the BMRB used for training. Diagonal elements (in blue) represent correct classifications, and non-diagonal elements (in red) represent incorrect classifications. The values listed to the right of each chart are the sum, for all 17 proteins used, of residues of a given type that were classified. The charts are “row normalized”, that is to say, each row shows the distribution of how true amino acid types were classified. Weighted mean accuracy (weighted by the number of residues in the proteins) was calculated for each case: A) subset (i), 66.94%; B) subset (ii), 88.67%; and C) subset (iii), 89.51%.

Classifying unknown spin systems

We tested the proposed approach on experimental data from α-synuclein. Fig 6 shows the peak positions in a Cβ/Hβ spectral projection, with the chemical shift distribution for the training data set from the 17 BMRB entries superimposed on top of it. Although Fig 4 shows that Cβ and Hβ are the most discriminating, they clearly provide insufficient resolution for most of the groups—only alanine, leucine and isoleucine can be clearly distinguished, while the remaining spin systems are impossible to identify based on Cβ and Hβ alone. This provides additional motivation for using the more advanced LDA approach.

Fig 6. Cβ/Hβ plane showing the chemical shifts of 17 unfolded proteins from the BMRB (shown in different colors) and 2D projections of higher-dimensional spectra of α-synuclein (shown in black).

Fig 6

Fig 7 shows the results of using LDA on the α-synuclein data. To make the comparison clear, we used only residues for which all seven chemical shifts were known, allowing us to construct subset (iii). The only exceptions were glycine and proline residues, which naturally lack chemical shifts of missing nuclei (Cβ and Hβ for glycine, HN for proline). These residues were included if all other chemical shifts (5 or 6, respectively) were present. In the figure we did not take into account several residues with resonances missing for any other reason. However, as emphasized previously and shown later in Sections Application 1: Mapping spin-system chains and Application 2: Resonance list transfer, it is generally possible to perform classification of spin systems with missing chemical shifts.

Fig 7. Results of LDA of the chemical shifts of α-synuclein: A) subset (i) (HN, N, CO and Cα); B) subset (ii) (Cα, CβHα and Hβ); and C) subset (iii) (HN, N, CO, Cα, Cβ, Hα and Hβ).

Fig 7

The vertical axis shows the spin system numbers, the horizontal axis the amino acid types (LDA classes). Spin system numbers are colored according to their true amino acid type. Marker sizes indicate the probability, according to LDA, that the spin system in question belongs to a given class.

The trends observable in Fig 7 are in line with the predictions given in Fig 4. LDA of subset (i) enables unambiguous recognition of some amino acid types (alanine, serine, threonine and glycine), but it is ambiguous for others. In particular, some groups of amino acid residue types can be confused: a) aspragine and aspartic acid; b) glutamine, glutamic acid and lysine; c) phenylalanine and tyrosine; and d) isoleucine and valine. For these residue types, the probabilities are similar for each of the amino acid types within the group—although, typically, the highest probability corresponds to the correct type. Residues also exist for which the correct amino acid type was not recognized at all using the chemical shifts of data subset (i), namely one methionine, one leucine and one histidine. The ambiguity is reduced when using the chemical shifts of subsets (ii) and (iii), in which case for asparagine, aspartic acid, histidine, isoleucine, leucine and valine, the probability of the identifying the correct amino acid type is almost 100%. For lysine, methionine and tyrosine, the probability exceeds 70%. The only ambiguities that remain are glutamine and glutamic acid (although now the probability of identifying the correct type are higher) and phenylalanine (that can still be confused with tyrosine).

We may conclude that subset (ii) provides results that are almost as good as those for subset (iii), and that both subsets allow much more effective LDA than subset (i). As subset (ii) can be obtained from a single experiment, we recommend using this subset as the approach of choice. The python code used for LDA analysis along with NMR data of the α-synuclein IDP and all input files needed to reproduce the results shown in Fig 7 are readily available at a public GitHub repository [30].

Application 1: Mapping spin-system chains

The assignment of backbone resonances in a protein is a two-step process: First, we form spin system chains, then we map them onto the known amino acid sequence. The latter step can be greatly enhanced by LDA. LDA identifies residues in a chain more efficiently than traditional “manual” recognition, which typically finds only glycines, alanines and serines/threonines. Optionally, the LDA analysis can be followed by filtering the results using an amino-acid sequence of the protein under consideration. Filtering “impossible” chains can be done automatically by using the output of the LDA analysis. First, for each chain a number of amino acid sequences are formed, which rise from all the combinations of amino acid types that LDA predicts as probable for each spin system in the chain. Then, combinations which are not present in the sequence of the test protein are discarded. This procedure is included in the code provided in the GitHub repository as an optional feature: if an input file containing the spin systems chains is given as input, then the code will give an extra spreadsheet as output detailing all possible amino acid sequences for each chain, their probabilities and the discarded combinations. In the examples below we did use this option.

Fig 8 shows chain mapping cases of increasing difficulty. The easiest task is to map relatively long chains with several easily recognizable residues. The chain of seven residues shown in Fig 8A) contains two glycines and can be mapped manually without ambiguity. Nevertheless, LDA provides even more reliable mapping, as it recognizes all seven residue types with > 90% probability.

Fig 8. Comparison of the chain mapping procedure for α-synuclein with and without LDA analysis.

Fig 8

LDA was performed using subset (iii), but in some residues, certain chemical shifts were missing and a reduced dataset was used. The numbers of the spin systems forming the chains are shown in square boxes. The label “pre” stands for the amino acid residue preceding the formed chain. The arrows labeled “LDA” point to the results of the LDA analysis, and the size of the one-letter codes corresponds to the probability as determined by LDA. The arrows labeled “aa-seq filtering” point at the results after amino-acid sequence filtering, reducing the number of possibilities. The arrows pointing left point to unambiguous manual identification of glycine, alanine, serine and threonine (“X” is used for all other amino acid types). All chains for which identification is consistent are shown on both sides, and the correct chain is marked in green. Panels A)-E) correspond to chain-mapping tasks of increasing difficulty (see Section Application 1: Mapping spin-system chains for details).

Fig 8B) shows the more difficult case of a shorter chain. It is possible to manually identify one of the three residues as serine or threonine. However, this is not sufficient for unambiguous mapping. On the other hand, LDA followed by amino acid sequence filtering provides precise result.

Sometimes, as in Fig 8C), the probability corresponding to the correct amino acid type is not the highest one. Fortunately, we can make the correct choice based on a protein sequence that lets us rule out “impossible” chains, even if the LDA implies that they are the most probable ones. Notably, unambiguous manual mapping of the chain in Fig 8C) is not possible, as only one characteristic residue (glycine) is present.

Often, short chains do not contain even a single easily recognizable residue. Fig 8D) gives an example of such a chain, one that is practically impossible to map manually. LDA makes mapping possible, but we need to consider various combinations of residues. In the case in Fig 8D), only one combination—that with the second-highest LDA probability for two of the three residues—corresponds to the fragment of the protein sequence allowing unambiguous mapping.

Finally, in rare cases, LDA may produce the wrong result for specific residues. An example is shown in Fig 8E). The “preceding” residue (6K) is wrongly identified by LDA as glutamic acid or glutamine; in fact, it is lysine. It could not be glutamic acid or glutamine, as this would lead to “impossible” chains (EGLS or QGLS) that do not exist in the protein sequence. In such cases correct mapping would require manual intervention or employment of more advanced assignment algorithm. Notably, in this example it is again impossible to unambiguously map the chain manually, as with one glycine, three mappings are possible.

Application 2: Resonance list transfer

Another example of where LDA can be used is the common task of transferring a resonance list from the repository (for example, the BMRB) to the experimental spectrum of a new sample. Usually, the experimental conditions (temperature, pH, ionic strength, concentration, and so on) are not exactly the same as those reported in the repository, and peaks may be shifted in a non-systematic manner. To illustrate this problem, we investigated one of the regions of the 15N HSQC spectrum of α-synuclein, containing a variety of amino acid residue types (see Fig 9A).

Fig 9. Application of LDA to the resonance list transfer from BMRB Entry 6968 to the experimental spectrum of α-synuclein.

Fig 9

A) A fragment of the 15N-HSQC spectrum with peaks marked with dots. Peaks from the BMRB list are marked with crosses. Labels indicate the corresponding residue name. B) Correct transfer of assignment from BMRB to the experimental peak list (unambiguous assignments shown with solid arrows, more ambiguous ones—with dashed arrows). The experimental and BMRB peaks corresponding to the successors of the residues of the same amino acid type are marked in the same color. C) Hα, Hβ, Cα and Cβ chemical shifts of the residues preceding the residues shown in panels A) and B). For experimental peaks, these chemical shifts were obtained from the 4D HabCab(CO)NH experiment and then analyzed with LDA, enabling assignment transfer. Color coding as on panel B). The full spectrum with both peak-lists is available at 10.5281/zenodo.7032142.

We picked the 16 resonances in the region of interest (indicated by dots in Fig 9A). We also loaded the peak list from the BMRB entry 6968 (indicated by crosses). It appears that almost all peaks from the BMRB list deviated from the peaks generated in the experiment, resulting in ambiguity during transfer of assignment.

For this reason, we decided to facilitate the transfer by means of LDA. Using 1H and 15N experimental peak positions, we peak picked the 4D HabCab(CO)NH spectrum, obtaining the Hα, Hβ, Cα and Cβ chemical shifts of the preceding residues. We then performed LDA on them (Fig 9C).

In many cases, where only one resonance corresponding to the given amino acid type occurred in the proximity of the peak under consideration, LDA recognition allowed for unambiguous transfer of assignment (solid arrows in Fig 9B and 9C). However, where several peaks corresponding to the same type occurred close to the peak, ambiguity remained (assignment shown with dashed arrows). For instance, in the region considered in Fig 9A there were four alanine peaks. The patterns of the experimental and BMRB peaks corresponding to this amino acid type (indicated by blue dots and crosses in Fig 9B) were very similar, and indeed, in this case, choosing the nearest option was correct. However, care is called for in such situations, as some deviations may be significant; in the absence of additional information (for example, about the sequential connectivities) the mapping may be incorrect.

An interesting case is the BMRB peak corresponding to Y39-V40. Two experimental peaks occur in its vicinity, both with significant probabilities of being a tyrosine residue at i − 1 position (43% in the case of the closest experimental peak, 89% in the case of the second-closest peak). However, another BMRB peak also occurs close by, corresponding to F94-V95. As the probabilities of phenylalanine for the experimental peaks in question were 56% and 11%, the F94-V95 was assigned to its closest peak and Y39-V40 to its second-closest peak. Thus, the probabilities represent valuable information in such cases.

To sum up, using LDA can greatly facilitate assignment transfer, often preventing incorrect transfer to the closest neighboring peak. However, even when using LDA, care is called for and different possibilities should be considered.

Practical recommendations

Our experiments show that LDA is a versatile tool that can support spectral analysis in many ways. Users will no doubt develop their own habits and practices when it comes to employing the tool. Below, we summarize our own practical recommendations, based on our experience.

  • As in all machine learning methods (MLMs), the best training set for LDA will have similar features to those of the test data. For the chemical shifts of proteins, that means that all the molecules in the training set and the test data should lack a secondary structure. This condition is not strict, however, and—as shown in this paper—quite impressive results are possible even with a very coarse selection of proteins for the training data.

    In fact, the training data contained small regions with residual structure: alpha-helical (up to 50%) for residues 60–78 in BMRB data set 11526, similarly long (19 residues) α-helical linker in set 15176 and several transient helices in data set 15179 (one of them with secondary shifts of Cα of up to 4 ppm, suggesting complete formation of the helix).

  • As many training spin systems as possible should be used (again, this is true for all MLMs). Interestingly, for very large proteins it might be possible to use data from the same molecule for both training and testing. In other words, LDA can be used to complete the assignment after the majority (for example, 80%) of the residues have been assigned using traditional methods, and then used to form the training set.

  • 4D HabCab(CO)NH is the experiment that provides the best data for LDA. For optimal resolution, non-uniform sampling (NUS) must be used for signal acquisition, with a variety of possibilities for processing. As we had a 3D HNCO at our disposal, we were able to process the data using a sparse multidimensional Fourier transform [26, 31], but numerous other options are possible, including compressed sensing [32, 33], maximum entropy [34], variants of the CLEAN algorithm [35, 36], projection spectroscopy [37], and many others [3841]. The proper separation of NH resonances is particularly important, since it affects the proper determination of the most differentiating chemical shifts (Cβ, Hβ) and formation of spin systems. Besides resolution-enhancement by NUS, one may solve the problem by increasing the experiment dimensionality—acquisition of 5D HabCabCONH would allow separating the spin systems using triples of frequencies (HiN, Ni and COi−1) which significantly reduces the overlap problem. [25, 26] Another approach can be the application of methods based on 13C detection. [42] On the other hand, the ambiguous cases that result from peak overlap can be easily detected and not taken into consideration. Thus, the situation is not as “dangerous” as e.g. for the sequential assignment.

  • The results of the LDA should always be combined with other available information. For example, incorrect classifications can often be detected by comparing the results from LDA for the spin system chains with the protein’s primary structure. “Impossible” chains should not be considered, even if the amino acids that compose it have the highest probability according to LDA. Automatic filtering of impossible chains is incorporated as an option in our code.

Conclusion

In this paper, we show that linear discriminant analysis (LDA) is a reliable classification method supporting the assignment of resonances in NMR spectra. The method can help with many tasks, such as chain mapping and assignment transfer. In our experience, it is easy to obtain a training set from the BMRB with only coarse filtering—that is to say, using proteins marked as “unfolded”, “unstructured” or “disordered”. We believe that, with a growing number of IDP resonance assignments in the databases, this method will become even more powerful and reliable in the future.

Supporting information

S1 Text. LDA vs. other classification methods.

We analyzed the performance of LDA, Quadratic Discriminant Analysis (QDA), K-Nearest Neighbours (KNN) and Support Vector Machines (SVM) to pick the best classification method for our research. We looked at specific performance parameters such as accuracy, sensitivity, specificity and consistency.

(PDF)

S2 Text. LDA classification of proteins in the training set.

The results of LDA for the proteins from the training data obtained in the same way as Fig 7 (subset (iii)).

(PDF)

S3 Text. Comparison of LDA performance with TSAR amino-acid recognition procedure.

The performance of LDA approach was compared with the amino-acid recognition procedure of the TSAR program. [13]

(PDF)

S1 Fig. LDA classification of proteins in the training set.

(PDF)

S2 Fig. Sensitivity and specificity of the methods.

Comparison of performance in classification for all 4 methods.

(PDF)

S3 Fig. LDA classification of proteins in the training set.

The results for all 17 proteins from the BMRB that compose the training set are shown, demonstrating the efficiency and accuracy of the LDA approach.

(PDF)

S4 Fig. Comparison of LDA performance with TSAR amino-acid recognition procedure.

The results were shown for α-synuclein spin sytems.

(PDF)

S1 Table. Summary of classification performances.

Values of analyzed classification performance parameters are given for each method.

(PDF)

S2 Table. Training set sample conditions.

Sample conditions of the training data sets, as reported in BMRB entries.

(PDF)

Acknowledgments

The authors thank Dr. Thomas Schwarz and Prof. Robert Konrat from the University of Vienna, Max Perutz Laboratories for providing the α-synuclein sample.

Data Availability

All relevant data are deposited in Zenodo (10.5281/zenodo.7032142) and GitHub (https://github.com/gugumatz/LDA-for-mapping-IDPs).

Funding Statement

JAR, PP and KK thank National Science Centre of Poland (www.ncn.gov.pl) for its support in the form of an OPUS grant (2019/35/B/ST4/01506). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Bondos SE, Dunker AK, Uversky VN. On the roles of intrinsically disordered proteins and regions in cell communication and signaling; 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Nwanochie E, Uversky VN. Structure determination by single-particle cryo-electron microscopy: Only the sky (and intrinsic disorder) is the limit; 2019. [DOI] [PMC free article] [PubMed]
  • 3. Schwarzinger S, Kroon GJA, Foss TR, Chung J, Wright PE, Dyson HJ. Sequence-dependent correction of random coil NMR chemical shifts. Journal of the American Chemical Society. 2001;123. doi: 10.1021/ja003760i [DOI] [PubMed] [Google Scholar]
  • 4. Wallin S. Intrinsically disordered proteins: structural and functional dynamics. Research and Reports in Biology. 2017;Volume 8. doi: 10.2147/RRB.S57282 [DOI] [Google Scholar]
  • 5. Grzesiek S, Bax A. Amino acid type determination in the sequential assignment procedure of uniformly 13C/15N-enriched proteins. Journal of Biomolecular NMR. 1993;3. doi: 10.1007/BF00178261 [DOI] [PubMed] [Google Scholar]
  • 6. Ikura M, Kay LE, Bax A. A Novel Approach for Sequential Assignment of 1H, 13C, and 15N Spectra of Larger Proteins: Heteronuclear Triple-Resonance Three-Dimensional NMR Spectroscopy. Application to Calmodulin. Biochemistry. 1990;29. doi: 10.1021/bi00471a022 [DOI] [PubMed] [Google Scholar]
  • 7. Kazimierczuk K, Stanek J, Zawadzka-Kazimierczuk A, Koźmiński W. High-dimensional NMR spectra for structural studies of biomolecules. ChemPhysChem. 2013;14(13):3015–3025. doi: 10.1002/cphc.201300277 [DOI] [PubMed] [Google Scholar]
  • 8. Grudziąż K, Zawadzka-Kazimierczuk A, Koźmiński W. High-dimensional NMR methods for intrinsically disordered proteins studies. Methods. 2018;148:81–87. doi: 10.1016/j.ymeth.2018.04.031 [DOI] [PubMed] [Google Scholar]
  • 9. Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, Ioannidis YE, Lin J, et al. BioMagResBank. Nucleic Acids Research. 2008;36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Tamiola K, Acar B, Mulder FAA. Sequence-specific random coil chemical shifts of intrinsically disordered proteins. Journal of the American Chemical Society. 2010;132. doi: 10.1021/ja105656t [DOI] [PubMed] [Google Scholar]
  • 11. Nielsen JT, Mulder FAA. POTENCI: prediction of temperature, neighbor and pH-corrected chemical shifts for intrinsically disordered proteins. Journal of Biomolecular NMR. 2018;70:141–165. doi: 10.1007/s10858-018-0166-5 [DOI] [PubMed] [Google Scholar]
  • 12. Bermel W, Bertini I, Chill J, Felli IC, Haba N, V VKM, et al. Exclusively Heteronuclear 13C-Detected Amino-Acid-Selective NMR Experiments for the Study of Intrinsically Disordered Proteins (IDPs). ChemBioChem. 2012;13. doi: 10.1002/cbic.201290067 [DOI] [PubMed] [Google Scholar]
  • 13. Zawadzka-Kazimierczuk A, Koźmiński W, Billeter M. TSAR: A program for automatic resonance assignment using 2D cross-sections of high dimensionality, high-resolution spectra. Journal of Biomolecular NMR. 2012;54(1):81–95. doi: 10.1007/s10858-012-9652-3 [DOI] [PubMed] [Google Scholar]
  • 14. Schmidt E, Güntert P. A New Algorithm for Reliable and General NMR Resonance Assignment. Journal of the American Chemical Society. 2012;134:12817–12829. doi: 10.1021/ja305091n [DOI] [PubMed] [Google Scholar]
  • 15. Piai A, Gonnelli L, Felli IC, Pierattelli R, Kazimierczuk K, Grudziaz K, et al. Amino acid recognition for automatic resonance assignment of intrinsically disordered proteins. Journal of Biomolecular NMR. 2016;64. doi: 10.1007/s10858-016-0024-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Evangelidis T, Nerli S, Nováček J, Brereton AE, Karplus PA, Dotas RR, et al. Automated NMR resonance assignments and structure determination using a minimal set of 4D spectra. Nature Communications. 2018;9:384. doi: 10.1038/s41467-017-02592-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. YongE F, GaoShan K. Identify Beta-Hairpin Motifs with Quadratic Discriminant Algorithm Based on the Chemical Shifts. PLOS ONE. 2015;10:e0139280. doi: 10.1371/journal.pone.0139280 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Balakrishnama S, Ganapathiraju A. Linear discriminant analysis-a brief tutorial. Institute for Signal and information Processing. 1998;18(1998):1–8. [Google Scholar]
  • 19. Tharwat A, Gaber T, Ibrahim A, Hassanien AE. Linear discriminant analysis: A detailed tutorial. AI communications. 2017;30(2):169–190. doi: 10.3233/AIC-170729 [DOI] [Google Scholar]
  • 20.Ghojogh B, Crowley M. Linear and quadratic discriminant analysis: Tutorial. arXiv preprint arXiv:190602590. 2019;.
  • 21. Tharwat A. Linear vs. quadratic discriminant analysis classifier: a tutorial. International Journal of Applied Pattern Recognition. 2016;3(2):145–180. doi: 10.1504/IJAPR.2016.079050 [DOI] [Google Scholar]
  • 22. Wrasidlo W, Tsigelny IF, Price DL, Dutta G, Rockenstein E, Schwarz TC, et al. A de novo compound targeting α-synuclein improves deficits in models of Parkinson’s disease. Brain. 2016;139(12):3217–3236. doi: 10.1093/brain/aww238 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Kay LE, Ikura M, Tschudin R, Bax A. Three-dimensional triple-resonance NMR Spectroscopy of isotopically enriched proteins. Journal of Magnetic Resonance. 2011;213(2):423–441. doi: 10.1016/j.jmr.2011.09.004 [DOI] [PubMed] [Google Scholar]
  • 24. Yang D, Kay LE. TROSY triple-resonance four-dimensional NMR spectroscopy of a 46 ns tumbling protein. Journal of the American Chemical Society. 1999;121(11):2571–2575. doi: 10.1021/ja984056t [DOI] [Google Scholar]
  • 25. Staykova DK, Fredriksson J, Bermel W, Billeter M. Assignment of protein NMR spectra based on projections, multi-way decomposition and a fast correlation approach. Journal of Biomolecular NMR. 2008;42(2):87–97. doi: 10.1007/s10858-008-9265-z [DOI] [PubMed] [Google Scholar]
  • 26. Kazimierczuk K, Zawadzka-Kazimierczuk A, Koźmiński W. Non-uniform frequency domain for optimal exploitation of non-uniform sampling. Journal of Magnetic Resonance. 2010;205(2):286–292. doi: 10.1016/j.jmr.2010.05.012 [DOI] [PubMed] [Google Scholar]
  • 27. Kazimierczuk K, Zawadzka A, Koźmiński W, Zhukov I. Random sampling of evolution time space and Fourier transform processing. Journal of Biomolecular NMR. 2006;36(3):157–168. doi: 10.1007/s10858-006-9077-y [DOI] [PubMed] [Google Scholar]
  • 28. Lee W, Tonelli M, Markley JL. NMRFAM-SPARKY: Enhanced software for biomolecular NMR spectroscopy. Bioinformatics. 2015;. doi: 10.1093/bioinformatics/btu830 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. https://bmrb.io/.
  • 30. https://github.com/gugumatz/LDA-for-mapping-IDPs;.
  • 31. Zawadzka-Kazimierczuk A, Kazimierczuk K, Koźmiński W. A set of 4D NMR experiments of enhanced resolution for easy resonance assignment in proteins. Journal of Magnetic Resonance. 2010;202(1):109–116. doi: 10.1016/j.jmr.2009.10.006 [DOI] [PubMed] [Google Scholar]
  • 32. Holland DJ, Bostock MJ, Gladden LF, Nietlispach D. Fast multidimensional NMR spectroscopy using compressed sensing. Angewandte Chemie—International Edition. 2011;50(29):6548–6551. doi: 10.1002/anie.201100440 [DOI] [PubMed] [Google Scholar]
  • 33. Kazimierczuk K, Orekhov VY. Accelerated NMR spectroscopy by using compressed sensing. Angewandte Chemie—International Edition. 2011;50(24):5556–5559. doi: 10.1002/anie.201100370 [DOI] [PubMed] [Google Scholar]
  • 34. Mobli M, Stern AS, Bermel W, King GF, Hoch JC. A non-uniformly sampled 4D HCC(CO)NH-TOCSY experiment processed using maximum entropy for rapid protein sidechain assignment. Journal of magnetic resonance (San Diego, Calif: 1997). 2010;204(1):160–4. doi: 10.1016/j.jmr.2010.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Coggins BE, Zhou P. High resolution 4-D spectroscopy with sparse concentric shell sampling and FFT-CLEAN. Journal of Biomolecular NMR. 2008;42(4):225–239. doi: 10.1007/s10858-008-9275-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Stanek J, Augustyniak R, Koźmiński W. Suppression of sampling artefacts in high-resolution four-dimensional NMR spectra using signal separation algorithm. Journal of Magnetic Resonance. 2012;214(1):91–102. doi: 10.1016/j.jmr.2011.10.009 [DOI] [PubMed] [Google Scholar]
  • 37. Hiller S, Fiorito F, Wüthrich K, Wider G. Automated projection spectroscopy (APSY). Proceedings of the National Academy of Sciences. 2005;102(31):10876–10881. doi: 10.1073/pnas.0504818102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Jaravine VA, Zhuravleva AV, Permi P, Ibraghimov I, Orekhov VY. Hyperdimensional NMR spectroscopy with nonlinear sampling. Journal of the American Chemical Society. 2008;130(12):3927–3936. doi: 10.1021/ja077282o [DOI] [PubMed] [Google Scholar]
  • 39. Hassanieh H, Mayzel M, Shi L, Katabi D, Orekhov VY. Fast multi-dimensional NMR acquisition and processing using the sparse FFT. Journal of Biomolecular NMR. 2015;63(1):9–19. doi: 10.1007/s10858-015-9952-5 [DOI] [PubMed] [Google Scholar]
  • 40. Hansen DF. Using Deep Neural Networks to Reconstruct Non-uniformly Sampled NMR Spectra. Journal of Biomolecular NMR. 2019;73(10-11):577–585. doi: 10.1007/s10858-019-00265-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Pustovalova Y, Mayzel M, Orekhov VY. XLSY: Extra-Large NMR Spectroscopy. Angewandte Chemie—International Edition. 2018;. [DOI] [PMC free article] [PubMed]
  • 42. Felli IC, Pierattelli R. 13C Direct Detected NMR for Challenging Systems. Chemical Reviews. 2022;122(10):9468–9496. doi: 10.1021/acs.chemrev.1c00871 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010258.r001

Decision Letter 0

Anna R Panchenko, Arne Elofsson

14 Jul 2022

Dear Prof. Kazimierczuk,

Thank you very much for submitting your manuscript "Linear discriminant analysis reveals hidden patterns in NMR chemical shifts of intrinsically disordered proteins" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by three independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. Importantly, please make your software, user manual, training and test sets available and include a GitHub link in the abstract. 

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Anna R Panchenko

Associate Editor

PLOS Computational Biology

Arne Elofsson

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The paper presents an interesting method for determining residue types in the NMR backbone assignment of intrinsically disordered proteins (IDPs), which generally suffer from severe resonance overlap. Linear discriminant analysis (LDA) is used to recognise amino acid residue types and shown to work (slightly) better than three other approaches.

The method is useful, and the paper is very well written and illustrated. The caveat is that the method works on ‘spin systems’ that have to be assembled beforehand from spectra/peak lists. This can be a limiting factor for IDPs, which generally feature extensive overlap that precludes the unambiguous formation of spin systems.

Provided that the software is made available (which seems not to be the case now), I propose publication after (very) minor revision.

Minor points:

- Author summary, p.1: NMR is one of the methods for IDP research, rather than “the method”.

- Methods (or Results): Describe the preparation of spin systems, in particular, how ambiguities have been handled/resolved.

- p. 5: The results of the comparison of four classification methods in the SI is hardly discussed in the main paper.

- p. 6: ‘(see Section )’ misses the section number.

- p. 7: ‘described in Section ‘ misses the section number.

- p. 8: phenyloalanine —> phenylalanine (2x)

- p. 9, Application 1: Again, describe how spin systems are formed.

- p. 9, caption of Fig. 8: From which protein(s) are the examples shown?

- P. 10: ’16 resonances in the region of interest’: Is the situation similar the other residues?

- Ref. 35 Wu K —> Wüthrich K

- Ref. 36 Orekhov VYVY —> Orekhov VY

- Fig. 6: What are the outliers at w(HB) = 0.6-0.8 ppm?

Reviewer #2: The Romero et al present a statistical approach to facilitate backbone assignments of IDPs. Whilst the work contains some interesting ideas, there are a number of issues and limitations.

1. The authors somehow failed to demonstrate that the approach have any advantages over the use of the 'traditional' chemical shift connectivity methods along that are very successful to assign small and well-behaving IDPs as 1 mM aSyn. The importance of the work would be significantly enhanced if the author could demonstrate that the proposed algorithm improves accuracy and/or time-efficiency for more complex IDPs; for examples, IDPs with limited/sparse chemical shift connectivities (e.g., due to low sample concentrations or poor data quality) or/and with significant peak overlapping (e.g. 300+ a.a. IDPs).

2. The figure 5 clearly demonstrated that the accuracy of amino type predictions is <70% for 7 a.a. types out of 20, even if all seven chemical shifts were used for the analysis. The authors should provide specific examples to demonstrate how this (sometimes inaccurate) information could be incorporated into the assignment pipeline and how the use of this information would benefit the accuracy and/or efficiency the assignment process.

3. The proposed methods obviously relies on the availability (particularly CB) chemical shifts for individual spin systems. However, for the larger (100+ a.a.) IDPs the unambiguous assignments of 3D and 4D peaks to a specific spin systems often became a major issue due to peak overlapping in NH 2D. The authors should discuss how this would affect the accuracy of the prediction.

4. In Application 2, the authors should show the accuracy of assignment transfer for all peaks in the spectra rather than for a small regions as well as compare it with the accuracy of assignment transfer obtained using only connectivity information from the 4D HabCac(CO)NH spectrum. Moreover, I expect that for aSyn, the assignments for the majority of resonances can be unambiguously transfered using connectivities obtained from a simpler (and more sensitive) 3D HNCA experiment, while CB chemical shift would be critical for the LDA approach. The authors should provide example(s) that clearly demonstrates that the information about a.a. types (in addition to connectivity information obtained from the same experiment) would result in more accurate/efficient assignment transfer.

5. The authors should provide detailed information about the training protein set they used, including protein sizes, a.a. composition, experimental conditions (particularly, pH, temperature, referencing method).

Reviewer #3: --Romero et al. have outlined a new approach to assign spin systems to the amino-acid sequence of a protein, which is one of the most important steps in every bioNMR project. The approach, based on linear discriminant analysis (LDA), enables both resonance assignment as well as the transfer of assignments to spectra obtained under slightly different conditions where the peaks have moved from the reference assignment. I am impressed with the unique approach used by the authors, as I was previously unaware of LDA and had not seen it applied in NMR. That said, I am not surprised that these authors developed this new approach, given their excellent track record of applying NMR to obtain quantitative insight into intrinsically disordered proteins as well as developing advanced non-uniform sampling/spectral reconstruction methods and automated resonance assignment methods (e.g., TSAR).

--The new LDA approach performs very well on the tested benchmark comprising 18 proteins from the BMRB with known resonance assignments, with a mean weighted accuracy of 90%. I note that the accuracy per protein ranges from ~98% to ~78%, so there is room to explore why some proteins yield poorer results (see below). Moreover, the new LDA approach not only provides highly accurate resonance assignments based on a single 4D HabCab(CO)NH spectrum (recorded with NUS), but also shows the most probable assignment followed by the next most probable assignments. This is very useful for the user, as for spin systems with confidence values that are not close to 100%, the user can consider the top 2-3 most probable assignments to manually intervene.

--In general, I am strongly in favor of publishing this work in PLOS Computational Biology, given that the manuscript outlines an innovative approach, contains new methodology, and has the potential to advance the bioNMR studies of all IDPs (as well as folded proteins, given a re-training on folded protein spectra). Thus, I congratulate the authors on their new algorithm and the unique insight that it brings. However, before publication, I would suggest to the authors a few changes that would hopefully strengthen their work and demonstrate the rigor and utility of this new LDA approach. All of my suggested changes should be relatively easy to do, and so I would welcome a resubmission after the authors have addressed these minor concerns.

--I have outlined my suggestions below:

(1) Given that this new LDA approach seems to have most utility in the resonance assignment procedure, the authors should compare the performance of their LDA approach with some other automatic resonance assignment software (e.g., MARS, TSAR, etc.). An ideal test scenario would be a performance comparison on alpha-synuclein given the same input chemical shifts (e.g., from the single 4D HabCab(CO)NH). Also, it would also be interesting to compare their performance of e.g. MARS or TSAR or AUTO-ASSIGN on the BMRB benchmark of 18 proteins with the results from the LDA approach.

--The main strength of the LDA approach, as I see it, is the LDA classification percentages (e.g., Figure 5 and Figure 7), which provides the user with more than one option in cases of low confidence. Thus, other software that only return one solution may yield inaccurate solutions whereas the LDA approach would very easily identify some ambiguity in the assignment solution space, which could most likely be easily resolved (as shown in Figure 8). Thus, it is important to show that the LDA approach provides new information relative to other state-of-the-art automatic assignment approaches.

(2) In Figure 8 (p. 9 of the text), the authors discuss how the LDA approach provides probability distributions in which one class (AA type) may not dominate (i.e., there is assignment ambiguity). The authors then state that the protein sequence can be used to rule out impossible solutions that cannot be consistent with the proposed assignments and the primary structure.

--I wonder if the authors could incorporate this manual step into the automatic analysis procedure? For example, if this manual calculation could be included as an automatic post-analysis step that follows the initial LDA classification, it would be helpful to the user. Because I fear that uninformed users will simply take the highest assignment probability as the “answer” and not check for compatibility with the primary structure/chemical shifts.

--If this analysis of probability distributions could be coded as consistency check with the primary structure/chemical shifts, it would be a nice extension of the work to ensure that the assignment ambiguity does not lead to errors but rather is exploited by the program to circumvent errors. Of course, there will be situations where the probability distribution analysis followed by a post-processing step of consistency with primary structure (which the authors currently perform manually on a case-by-case basis) cannot return an unambiguous correct result. In these cases, the LDA approach shines because it has informed the user that multiple solutions fit the data equally well and there may be the need to collect additional NMR data or to manually inspect the data very closely. Having the two steps combined into one automatic approach (i.e., LDA analysis + inspection of consistency with primary structure) would further strengthen the confidence in the results, knowing that the consistency check may be able to prevent erroneous interpretation of the results by the users.

(3) The authors assume one normal distribution for the chemical shift of a given spin in a given residue. And the authors further assume that the input chemical shifts in the benchmark would be representative of the underlying chemical shift distribution in the protein under investigation. However, the authors’ assumption of a single normal distribution may be problematic in the scenario in which the IDP under investigation has significant residual structure, meaning the distribution would have to cover +/-4 ppm. Residual secondary structure occurs frequently in IDPs, and a quick browsing of the BMRB benchmark used by the authors shows that BMRB ID 11526 has nearly significant alpha-helical secondary structure between residues ~60-78, reaching a maximum value of ~50% near residue 70 (https://link.springer.com/article/10.1007/s12104-013-9523-1). Another transient helix is found with secondary 13CA shifts of ~2 ppm in Darrp-32 (https://pubs.acs.org/doi/10.1021/bi801308y) from BMRB ID 15176 in the benchmark. Finally, in BMRB 15179 of the benchmark, there are at least 3 transient helices, the most C-terminal of which has secondary 13CA values near 4 ppm (from the same paper) suggesting near complete formation of the helix.

3a. Given these large secondary CA values (and presumably correspondingly large HA, HB, CB, CO secondary shifts going in the appropriate directions), could the authors comment on how they expect deviations from random coil character to affect the performance of the LDA classification?

3b. I wonder if there is a correlation between the errors produced by LDA analysis and the secondary CA or CB shifts? That is to say, have the authors noticed that the LDA classifier fails in situations where there is significant residual structure? This might be expected if the majority of residues in the benchmark are random coil whereas a small minority adopt transient secondary structure.

3c. Were the input chemical shifts corrected for possible referencing errors? I wonder if using, e.g. a sequence-based prediction of the expected/neighbor-corrected random coil shift (e.g., POTENCI) might yield a more accurate “center” of the normal distribution for the expected shift given the solution pH and temperature.

(4) On p. 8 the authors note that residues such as Glycine and Proline (and others with missing chemical shifts) were not used in the LDA classification procedure for alpha-synuclein. However, I worry that the authors might be discarding otherwise potentially useful data – for example, Gly residues are often used manually to make the first assignments given that XG/GX motifs in the sequence may be readily identified. And Pro residues have relatively unique chemical shifts that render them readily identified in the LDA classification shown in Figure 5B and 5C. In these plots, Gly is identified with 100% accuracy and Pro with 97% accuracy – thus, these two residues are highly useful and can be identified by LDA with high accuracy.

4a. Could the authors somehow include Gly and Pro in the calculation rather than excluding them?

4b. Along the same lines as above, could the authors make use of non-Gly/Pro residues that have incomplete data? For example, in Figure 4 I note that the authors show that CB is by far the most important for classification followed by HB and CA. The remaining chemical shifts are less important (< 0.1 relative importance). Thus, for residues that have at least CB HB and CA shifts, I would assume that these can still be used as input?

5. Finally, given that the most important shifts for the LDA classification are CB and HB, and thus side-chain shifts, I wonder if the authors considered using C(CO)NH or H(CCO)NH spectra? Was there a particular reason to choose the 4D HabCab(CO)NH spectrum over those mentioned above? I ask because the C(CO)NH and H(CCO)NH spectra would not only provide the highly diagnostic CB and HB shifts, but also even more diagnostic shifts for residues that extend beyond CB/HB. https://www.sciencedirect.com/science/article/abs/pii/S1064186683710198?via%3Dihub

For example, I note that Fig 5 shows that the LDA classification yields confidence values below 80% for Arg, Cys, Gln, Glu, Met, Phe, Trp, and Tyr residues. All of these except Cys would be very readily resolved with access to spins beyond HB/CB. I note that the authors have published an elegant 5D HC(CC-TOCSY)CONH that would provide access to some of these shifts.

https://www.sciencedirect.com/science/article/pii/S109078070900007X?via%3Dihub

If the problem lies with poor access to chemical shift statistics of IDPs for spins beyond HB/CB, then that is indeed a fundamental problem. But given that the LDA classifer performed so well with only 18 proteins as training input, I would hope that there are at least 18 IDPs that have complete 1H and 13C side-chain assignments to enable LDA classification with this expanded benchmark for side-chain chemical shifts.

----------------------------------------------------------

Minor comments

1. “alfa-synuclein” should be “alpha-synuclein” or better yet “α-synuclein”

2. p. 5 -- BMRB ID 6869 appears to be a folded protein? The web page with this ID is associated with the “solution structure of the C-Terminal 14 kDa Domain of the tau subunit from Escherichia coli DNA Polymerase III”

3. line 228, the “Section” number is missing

4. line 207, I would think that a citation should go here to e.g. some of the chemical-shift based work by the Bax group on TALOS or CS-Rosetta.

5. why is the spread in Figure 4 between 98+% and 78+%? Is this due to an underlying AA compositional bias? Do all proteins all have the same AA distribution? It would be interesting to see a correlation between the accuracy and the AA diversity for each of the proteins in the training set. Having access to such information would enable the user to predict a priori how well the LDA approach will work on their system of interest based on the underlying AA composition. It may be important to re-train on a larger database in the future if certain AAs are poorly represented (e.g. aromatics and hydrophobics)

6. For table 1, I guess the acq parameters for directly-detected 1H (t3 or t4) are the same and hence omitted? And I guess that ni = ni + ni2 (+ni3)?

7. In a typical manual analysis of NMR data, it is usually trivial to identify and assign the XG/GX, XA/AX, XS/SX, and XT/TX spin systems, as the authors even write in their introduction. Thus, would it be possible to enable the user to manually “fix” specific assignments before running the calculation? You might envision that including e.g. the first 25% of assignments (G, A, S, T-containing spin systems) that are usually trivial to obtain might improve the LDA classification by having this starting point. I don’t know if it’s possible to do this though.

8. I commend the authors for making their data available on Zenodo. I downloaded the HNCO and HabCab(CO)NH spectra and looked at them in Sparky. I was surprised that the HNCO contained so much noise; was it properly reconstructed? The HabCab(CO)NH spectrum on the other hand was very clean with minimal artifacts.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No: The software is not available. A Github page is announced in the header section, but no details are given in the paper.

Reviewer #2: No: the authors should provide the code for the LDA analysis

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010258.r003

Decision Letter 1

Anna R Panchenko, Arne Elofsson

20 Sep 2022

Dear Prof. Kazimierczuk,

We are pleased to inform you that your manuscript 'Linear discriminant analysis reveals hidden patterns in NMR chemical shifts of intrinsically disordered proteins' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to address minor comments from one of the reviewers and complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Anna R Panchenko

Academic Editor

PLOS Computational Biology

Arne Elofsson

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: Overall, the authors have addressed many points raised. However, they have somehow failed to demonstrate the advantages of using their approach to assign IDPs. In theory, information about a.a. types could help, but the manuscript would significantly benefit from demonstration of such benefits for real systems. For example, the authors could incorporate information about residue types into AUTOASSIGN (or other NMR assignment software) and show the % of correctly assigned residues with and without residue type information obtained from their algorithm. From the examples in the manuscripts, I have an impression that requirements for data quality for LDA and for sequential methods are very similar, meaning that it's likely that LDA would provide no benefit/extra information if the data quality is limited (i.e., for the majority of IDPs).

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010258.r004

Acceptance letter

Anna R Panchenko, Arne Elofsson

3 Oct 2022

PCOMPBIOL-D-22-00827R1

Linear discriminant analysis reveals hidden patterns in NMR chemical shifts of intrinsically disordered proteins

Dear Dr Kazimierczuk,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Agnes Pap

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. LDA vs. other classification methods.

    We analyzed the performance of LDA, Quadratic Discriminant Analysis (QDA), K-Nearest Neighbours (KNN) and Support Vector Machines (SVM) to pick the best classification method for our research. We looked at specific performance parameters such as accuracy, sensitivity, specificity and consistency.

    (PDF)

    S2 Text. LDA classification of proteins in the training set.

    The results of LDA for the proteins from the training data obtained in the same way as Fig 7 (subset (iii)).

    (PDF)

    S3 Text. Comparison of LDA performance with TSAR amino-acid recognition procedure.

    The performance of LDA approach was compared with the amino-acid recognition procedure of the TSAR program. [13]

    (PDF)

    S1 Fig. LDA classification of proteins in the training set.

    (PDF)

    S2 Fig. Sensitivity and specificity of the methods.

    Comparison of performance in classification for all 4 methods.

    (PDF)

    S3 Fig. LDA classification of proteins in the training set.

    The results for all 17 proteins from the BMRB that compose the training set are shown, demonstrating the efficiency and accuracy of the LDA approach.

    (PDF)

    S4 Fig. Comparison of LDA performance with TSAR amino-acid recognition procedure.

    The results were shown for α-synuclein spin sytems.

    (PDF)

    S1 Table. Summary of classification performances.

    Values of analyzed classification performance parameters are given for each method.

    (PDF)

    S2 Table. Training set sample conditions.

    Sample conditions of the training data sets, as reported in BMRB entries.

    (PDF)

    Attachment

    Submitted filename: answer to reviewers.pdf

    Data Availability Statement

    All relevant data are deposited in Zenodo (10.5281/zenodo.7032142) and GitHub (https://github.com/gugumatz/LDA-for-mapping-IDPs).


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES