A dataset of alternately located segments in protein crystal structures

Aviv A Rosenberg; Ailie Marx; Alexander M Bronstein

doi:10.1038/s41597-024-03595-4

. 2024 Jul 17;11:783. doi: 10.1038/s41597-024-03595-4

A dataset of alternately located segments in protein crystal structures

Aviv A Rosenberg ¹, Ailie Marx ^2,^✉, Alexander M Bronstein ^1,^✉

PMCID: PMC11255211 PMID: 39019896

Abstract

Protein Data Bank (PDB) files list the relative spatial location of atoms in a protein structure as the final output of the process of fitting and refining to experimentally determined electron density measurements. Where experimental evidence exists for multiple conformations, atoms are modelled in alternate locations. Programs reading PDB files commonly ignore these alternate conformations by default leaving users oblivious to the presence of alternate conformations in the structures they analyze. This has led to underappreciation of their prevalence, under characterisation of their features and limited the accessibility to this high-resolution data representing structural ensembles. We have trawled PDB files to extract structural features of residues with alternately located atoms. The output includes the distance between alternate conformations and identifies the location of these segments within the protein chain and in proximity of all other atoms within a defined radius. This dataset should be of use in efforts to predict multiple structures from a single sequence and support studies investigating protein flexibility and the association with protein function.

Subject terms: X-ray crystallography, Data mining

Background & Summary

The diffraction pattern produced from the interaction of X-rays with protein molecules can be used to calculate an electron density map into which a model of the structure can be built and refined. Since the signal from a single molecule is far too weak for detection, proteins are first coerced, through crystallization, into an ordered, repeating lattice of molecules so that the diffraction from each molecule will constructively intensify the signal. Flexible and variable regions of the protein chain will lead to destructive interference, the absence of definable electron density and regions of the protein structure that cannot be modelled, at the extreme, or smeared electron density and a low model certainty, as represented by a high temperature factor otherwise known as the B-factor¹. Rigid regions of the protein, having little variability between molecules in the crystal lattice, are characterised by high model certainty and low B-factor. The molecules in the crystal can also adopt several well-defined conformations such that the electron density is best described by modelling atoms into two or more discrete locations, commonly termed alternate locations (altlocs).

Proteins are dynamic molecules, transitioning between ensembles of conformations (recently reviewed in²), and structural biology has recently been calling for methods that will push beyond the one sequence, one structure framework towards description and prediction of ensembles of conformations³. Although the conformationally explored space in crystals is limited by the constraints of the crystal packing, a largely overlooked resource for experimentally observed protein flexibility are X-ray crystal structures modelled with alternate conformations. Gutermuth et al. recently argued this resource may have remained unnoticed since to date most modelling approaches either ignore altlocs altogether or resolve them with simple heuristics⁴. Indeed, whilst common structural visualization programs such Chimerax or Pymol display alternate side chain locations, only a single, default backbone conformation is shown unless the user specifically calls up the hidden alternate conformation. In some cases, even tools dedicated to question of protein flexibility appear to overlook alternate conformations. The PDBFlex database was developed to provide information of protein flexibility as revealed through comparison of different PDB structures for the same protein⁵. Looking into the database, last updated in 2020, we were unable to find reference to or use of alternate conformations in single crystal protein structures. More recently a group used machine learning to predict same sequence structure ensembles, again guided by cases of different PDB structures displaying conformational differences and originating from the same sequence, overlooking single crystal protein structures⁶. Altogether it seems that single crystal alternate backbone conformations are underused in attempts to characterize protein ensembles perhaps because the abundance of stable alternate conformations remains underrecognized.

The last years have seen efforts to develop tools for uncovering unmodeled alternately located atoms^7–10, particularly relevant for older PDB entries before the development of refinement software capable of automatically detecting altlocs. However, there are few resources that collate data on altlocs, from across the PDB. One that does is the Alternate Location Server, part of the OCA browser-database for protein structure/function, that maintains an up-to-date list of PDB structures having alternate locations¹¹. Although this list highlights the abundance of altlocs, including in backbone atoms, it provides very minimal data: a single line entry for each structure, noting the total number of residues modelled as altocs and the minimum, maximum and average distance between any backbone altlocs. To provide a more detailed resource useful for surveying the alternate conformation landscape and analyzing their prevalence in greater detail, we have created a custom dataset of alternately modelled backbone segments. The dataset is available through Harvard Dataverse¹².

Methods

Raw data collection

Protein structures are collected from the Protein Data Bank (PDB) through a structured query against the polymer entity data API^13,14. We queried for all entities in structures meeting the following criteria: (i) Method: X-Ray Diffraction; (ii) X-Ray Resolution ≤ 3.5 Å; (iii) R_free ≤ 0.33, (iv) Number of chains ≤ 20.

Each query result contains a list of PDB IDs with an entity number (e.g. 1ABC:1) matching the query criteria, and each entity within a PDB structure corresponds to one or more identical polypeptide chains which exist in the structure. Note that a structure may have more than one unique entity (e.g., 1ABC:1 and 1ABC:2) in which case we would obtain both. For each unique entity ID, we then obtain its associated chains (e.g., 1ABC:A and 1ABC:C), and include all of them in the dataset. In cases where the chain ID in the PDB files (author chains) do not match the canonical chain IDs assigned by the PDB, we map between the author and PDB chains, such that our dataset will contain only the canonical IDs.

Altloc collection

We use BioPython¹⁵ to parse the PDB structure files and extract the residues and atom locations from the collected chains. For each atom in the structure, we parse all available alternate locations (altlocs) from the file. The altlocs are usually labelled with capital letters starting from ‘A’. In cases where a structure has a non-standard altloc labelling, we sort the labels lexicographically and relabel them starting from ‘A’. In such cases, the dataset will denote these altlocs as e.g. ‘A(Z)’ in the altloc name columns, meaning that the altloc with original label Z is denoted by A in the dataset’s other columns. This relabeling helps keep the column names consistent across different structures.

Aligning to uniprot sequences

We align the amino acid sequence of each chain to the Uniprot¹⁶ record sequence to provide a Uniprot index for each residue in the chain. We query the PDB’s entry data API¹⁴ and examine the metadata to construct a mapping from the specific chain to a list of Uniprot IDs. Whilst most chains map to a single Uniprot ID, there are cases of synthetic proteins which have no associated Uniprot ID, and other cases where a chain is chimeric i.e. contains sections from multiple different proteins. We discard such cases and keep only chains which map to a unique Uniprot ID. The alignment is performed using BioPython’s default pairwise alignment algorithm. We used BLOSUM80 as the substitution matrix for the alignment, a gap-opening penalty of $-$ 10 and a gap-extension penalty of −0.5.

Backbone locations and dihedral angles per altloc

For each PDB chain, we calculate the backbone angles $(φ, ψ)$ , per altloc. To calculate a dihedral angle at altloc X, we take the X-altloc coordinate of all atoms participating in the calculation (from the current and previous/next residue). In case the atoms required for dihedral angle calculation from either the previous or next residue do not have the current altloc, we use the single set of coordinates modelled at that location in the calculation of the dihedral angles of all the current altlocs. The dataset also always includes the dihedral angles calculated with all atoms at their default positions, i.e. ignoring altlocs. For each of the backbone atoms in each residue, we also collect its XYZ coordinates under each of the altlocs which exist for it. Finally, we use DSSP to assign a secondary structure per residue.

B-factors, location standard deviations and distances between altlocs

We calculate the b-factor per residue, by averaging the b-factors of the N, CA and C backbone atoms. This is performed using the default atom positions, and additionally for each altloc which is defined for all three atoms.

For the CA atom, we also calculate, per altloc, the standard deviation in its location and distance from other altlocs, in Angstroms. The standard deviation is obtained from the b-factor $B_{X}$ of altloc X by $σ_{X} = \sqrt{B_{X} / 8 π^{2}}$ . For each pair of altlocs X and Y, we then calculate the distance in Angstroms between the alpha-carbons, $d_{X, Y} = ‖ p_{C A, X} - p_{C A, Y} ‖$ , where $p_{C A, X}$ is the location of the alpha-carbon under altloc X. We also calculate this distance in units of the standard deviation, which is given by ${\tilde{d}}_{X, Y} = ‖ p_{C A, X} - p_{C A, Y} ‖ / \sqrt{σ_{X} σ_{Y}}$ .

Finally, we calculate the peptide bond length between adjacent residues under each pair of altlocs of the current residue’s carbon and the next residue’s nitrogen.

Contacts

We calculate the contacts between all atoms of a residue, under all its altlocs, and all other atoms in the PDB structure, also under all possible altlocs. The per-atom contacts are then aggregated to the residue level for inclusion in the dataset.

First, we collect the set of locations of all atoms in the structure, under all altlocs. Next, we iterate over each residue in the chain, each atom within it, and each altloc defined for that atom. Given the location of this altloc atom as a source, we calculate the distance to each target atom in the set of all locations. Two atoms are defined as in contact when their distance is below a threshold of 5 Angstroms. Each detected contact is then classified into one of three types: regular AA contact, out-of-chain (OOC) contact or ligand contact, depending on the identity and chain of the target atom. Contacts from all atoms of the current residue are collected into one of three lists of contacts for that residue, based on this classification. The minimum distance is calculated across atoms belonging to the same source and target residue. Hydrogen atoms and water molecules are always excluded even if they are modeled in the structure.

Codon assignment

Since the exact genetic sequence of the protein is not annotated in the PDB we assigned codons from the native sequence following the procedure described in our prior work¹⁷. Given a PDB chain, we obtain its unique Uniprot ID from the previous step. We query Uniprot to obtain all cross-referenced IDs to the European Nucleotide Archive (ENA). From the ENA database, we obtain all available genetic sequences for the specific protein, translate each genetic sequence to an amino-acid sequence using the standard genetic code table, and perform pairwise sequence alignment between the PDB chain’s amino-acid sequence and the translated genetic sequences. The alignment is performed using BioPython using the same options as in the previous section.

Following the pairwise alignment of the amino acid sequence to all translated genetic sequences, we obtain the aligned codons from each sequence and assign them to corresponding residues from the PDB chain. This process yields zero or more assigned codons per residue in the PDB chain. In cases where there is more than one codon (i.e., different genetic sequences contributed different codons), we choose the most common, and reflect this ambiguity by assigning a codon score which is the proportion of genetic sequences that contributed the assigned codon.

Removal of low-quality structures

We used the R-factor to remove structures with a potentially poor fit to the electron density. The intersection of two criteria was used to define a structure as admissible: (1) R_work ≤ 0.98 R_free; and (2) R_free ≤ min{0.3, max{0.2, resolution-dependent cut-off}}. The resolution-dependent cut-off was fitted as a monotone polynomial to the 90%-tile of R_free estimated in 12 equiprobable resolution bins ranging from 0.5 to 3.5Å (Fig. 1).

Fig. 1 — Example of broken altloc chain shown on the crystal structure 1VYO (Avidin at 1.48Å resolution). Altloc B (red) is not modelled between residues 37 and 41 and is therefore counted as two separate segments in our data set.

Non-redundant cluster assignment

To account for redundancy of the collected chains and proteins they originate from, we clustered the chains into non-redundant clusters using the amino acid sequence data. Clustering was performed using mmseq2¹⁸ with minimum sequence identity threshold 0.5 and target coverage 0.8. Cluster identities were recorded in the metadata alongside with the chain identities.

Segmentation of contiguous altlocs

For each of the collected chains containing altlocs, we grouped all altlocs with contiguous residue numbers into numerically numbered segments. Assignment as a segment requires the altlocs at every location within the segment. This means that in cases of altlocs broken by missing residues, this section will be counted as two segments (Fig. 1)

Data Records

The dataset is available through Harvard Dataverse¹². The dataset contains amino acid-level data records and chain-level metadata records.

Data

A single comma-separated value (csv) file containing local altloc description for the entire dataset can be accessed as altloc_data.csv.

The dataset entries (rows) each represent a single residue in a specific chain of a PDB structure. For each row, columns, describing the residue, are available and are explained in Table 1.

Table 1.

Description of the column in the data records.

Column Name	Description	Example
pdb_id	PDB and chain identifier, delimited by a colon.	2WUR:A
unp_id	Uniprot protein identifier.	P42212
pdb_idx	Zero-based index of the residue in the PDB chain sequence.	13
unp_idx	Zero-based index of the corresponding residue in the Uniprot sequence.	12
seg_id	Contiguous altloc segment number within the chain.	1
res_name	Residue amino acid name, using the standard single-letter designation.	P
res_icode	Residue insertion code. Sometimes used by PDB authors to represents an experimental addition to the protein sequence. Value of ‘M’ denotes an unmodelled residue.	M
res_hflag	Residue hetero flag. Describes any molecules, besides amino acids and water, which are part of the chain.	H_GYS
rel_loc	Relative location of the residue from the start of the sequence, as a value in the interval (0, 1].	0.24
codon	Codon assigned to this residue by aligning to genetic sequence from ENA.	CCA
codon_score	The proportion of ENA records which contained the assigned codon at the current position. A value of 1.0 means the assignment was unique.	0.8
codon_opts	Names of all codons that were found in ENA records aligned to the current position, delimited by ‘/’.	GGA/GGT
secondary	DSSP-assigned secondary structure label.	E
phi_X	The standard $φ$ dihedral angle at this residue, calculated by using the altloc X of all atoms involved (where available). Given in degrees within the range (−180, 180].	−123.456
psi_X	Same as above, for the standard $ψ$ dihedral angle.	123.456
omega_X	Same as above, for the standard $ω$ dihedral angle.	−179.876
bfactor_X	Average b-factor computed from the b-factors of the X altloc in the current residue’s backbone atoms (N, CA, C)	6.543
contact_count_X	Total number of X-contacts (as defined in the preceding sections) between the current residue’s atoms and atoms of other residues in the structure.	42
contact_types_X	Type of contacts this residue’s atoms make. Currently not implemented. Always ‘proximal’.	proximal
contact_smax_X	Maximum distance, in the amino acid sequence, of any of the current residue’s X-contacts.	24
contact_ooc_X	Comma-delimited list of X-contacts between the current residue’s atoms and atoms of a residue in another chain. Contacts specified as: ‘chain_id:residue_name:residue_index:target_altloc:distance’. Target altloc appears as ~ if target has only one modeled location.	A:T:9:B:3.97, C:M:88:A:4.03
contact_non_aa_X	Comma-delimited list of X-contacts between the current residue’s atoms and atoms of ligand in the structure. Contact is specified as above, with ligand instead of residue name.	A:GYS:66:~:4.95, A:EOH:242:B:3.98
contact_aas_X	Comma-delimited list of X-contacts between the current residue’s atoms and atoms of a residue in the same chain. Contacts specified as above.	A:T:9:B:3.97, C:M:88:A:4.03

Open in a new tab

Where altlocs exist for a particular residue, additional columns are populated within the same row and include information pertaining to each altloc as described in Table 2. Italic characters, X and Y, in the column names shown in Table 2 are placeholders for actual altloc label names. We limit the altlocs to four (A, B, C, D).

Table 2.

Description of the columns describing altlocs in the data records.

Column Name	Description	Example
num_altlocs	Number of alternate location labels across all the backbone atoms of the residue.	3
altlocs_N	Semicolon-delimited names of altlocs which exist for the nitrogen atom. May contain optional original altloc label in parenthesis.	A;B(Z)
altlocs_CA	Semicolon-delimited names of altlocs which exist for the alpha-carbon atom. Format as above.	A;B(Z)
altlocs_C	Semicolon-delimited names of altlocs which exist for the carbon atom. Format as above.	A;B(Z)
dist_CA_XY	Distance in Angstroms between the alpha-carbon atom positions of this residue, for altloc X and altloc Y.	0.089
dist_CA_XY_norm	Same as above, where the distance is normalized to units of the standard deviation, based on the b-factors of both altlocs.	0.36
sigma_CA_X	Standard deviation of atom location, in Angstroms, for altloc X. Calculated from the isotropic b-factor as $σ = \sqrt{B / 8 π^{2}}$ .	0.235
n_terminal_dist	Distance in number of residues from the N-terminal. Note: if terminal regions contain unmodelled residues, the distance is under-estimated.	10
c_terminal_dist	Distance in number of residues from the C-terminal. Note: if terminal regions contain unmodelled residues, the distance is under-estimated.	10

Open in a new tab

Metadata

A single comma-separated value (csv) file containing PDB structure- and chain-level metadata can be accessed as altloc_metadata.csv.

The dataset entries (rows) each represent a single chain. Columns are described in Table 3.

Table 3.

Description of the column in the metadata records.

Column Name	Description
pdb_id	PDB and chain identifier, delimited by a colon.
unp_id	Uniprot protein identifier.
ena_id	Identifier of ENA genetic sequence used for codon assignment.
seq_len	Number of residues in PDB chain
num_altlocs	Number of residues which harbor altlocs in the PDB chain.
title	Title (as deposited in the PDB) of the structure containing this chain.
description	Description (as deposited in the PDB) of the structure containing this chain.
entity_description	Description (as deposited in the PDB) of the polymer entity associated with this chain.
deposition_date	Date the structure was deposited to the PDB.
entity_source_org	Name of the protein’s source organism.
entity_source_org_id	NCBI taxonomy ID of the source organism.
entity_host_org	Name of the host organism, i.e. the organism in which the protein was experimentally expressed.
entity_host_org_id	NCBI taxonomy ID of the host organism.
resolution	X-ray crystallography high resolution limit of data collection.
resolution_low	X-ray crystallography low resolution limit of data collection.
r_free	Structure refinement R_free value.
r_work	Structure refinement R_work value.
space_group	Symbol of space-group describing the crystal symmetries.
cg_ph	The pH at which the crystal was grown.
cg_temp	The temperature in kelvins at which the crystal was grown.
chain_ligands	List of the chemical component identifiers for all ligand interactions in the chain.
ligands	List of the chemical component identifiers for all ligand interactions in the structure.
entity_chains	PDB chain identifiers of all chains in the structure which belong to the same polymer entity as the current chain.
entity_auth_chains	As above but using the PDB deposition author’s original chain identifiers instead of the PDB-assigned identifiers.
chain_entities	Identifier (internal to the structure) of the polymer entity associated with this chain.
chain_to_auth_chain	Deposition author-assigned name of this chain.
entity_sequence	Canonical sequence of the protein in the standard one-letter code of amino acids.
num_altloc_segments	The number of segments of contiguous residues harboring altlocs within the chain.
cluster_id	The mmseq2 cluster identity of the current chain sequence.

Open in a new tab

Technical Validation

Since our data is derived from the PDB¹³, our starting point are records that have already been validated through the standardised procedures of this well-established global repository. The quality and reliability of crystallographic structures is commonly assessed by the R-factor that measure of the goodness of fit between the model and the experimental X-ray diffraction data for the refined data. A small percent of data is left out of refinement and an analogous R_free is calculated to assess against the R factor for potential overfitting biases. Figure 2 shows the resolution-dependent R_free thresholds we used to exclude models with a poor fit the experimental data.

Fig. 2 — Quantiles of the R_free parameter across the collected structures as a function of resolution. The cut-off above which low-quality structures were rejected is plotted in thick black.

As a validation step, we verified that our collected data follows expected trends with relation to altloc abundance. Figure 3 shows the distribution of resolutions for structures deposited in PDB during different time brackets showing that structures of resolutions between 1.5A and 2.5A constituted a large and similar fraction of structures deposited during the period since 1995. Focusing on this resolution range, Fig. 4 shows how the fraction of structures modelled with an altloc, particularly short segments of 1–2 amino acids, progressively increased until 2010. This rise would be expected as modelling refinement techniques improved and became more automated^19,20. Another clear and expected trend is the increase in altlocs with resolution as shown in Fig. 5.

Fig. 3 — Absolute and relative histograms of the resolution of the retained structures by structure deposition date.

Fig. 4 — Absolute and relative histograms of backbone altloc segment lengths in the resolution range 1.5–2.5Å by structure deposition date. *Any* refers to any of the collected chains regardless of altloc presence; *Altloc* refers to the presence of any altloc; Backbone ≥ n refers to the presence of a segment of length ≥ n containing alternate locations for CA atoms. We observe a steady and sharp increase in the fraction of structures with modelled altlocs in years preceding 2010 which we attribute to the improvement of experimental techniques and data analysis software. Note that the fraction of structures with long altloc segments (≥3) increases only slightly, probably since such segments are more readily modelled with older protocols.

Fig. 5 — Absolute and relative histograms of backbone altloc segment lengths by resolution. *Any* refers to any of the collected chains regardless of altloc presence; *Altloc* refers to the presence of any altloc; Backbone ≥ n refers to the presence of a segment of length ≥ n containing alternate locations for CA atoms. Shown are distributions of individual chains (first row), and non-redundant clusters (second row). Note that the fraction of modelled altloc segments (especially the long ones) consistently increases with better resolution. We attribute this trend to the finer ability to discern between truly multi-modal distribution of the electron density (altlocs), and the uni-modal high B-factor case. About 4% of segments of length 2 are peptide flips.

Manual verification of the correspondence between collected contact distances and those observed in visualisation of the protein structure was carried out as shown in Fig. 6. We note here that our current dataset analyses the contents of the asymmetric unit cell as these coordinates are explicitly available with the PDB file. A current limitation of this dataset is that it does not identify contacts made within the crystal lattice.

Fig. 6 — Categorization of contact types shown on the crystal structure 6MXX (TP53-binding protein 1 at 2.3Å resolution). Two well-separated 7 amino acid-long altloc segments in chain J (residues 1493–1499, displayed in red and blue) are in proximity of Y1523 in chain I (out-of-chain contact displayed in white), and K6P ligand molecule (ligand contact displayed in violet). The records corresponding to the highlighted contacts are contact_ooc_A[J:S:1497] = "I:Y:1523:~:3.05,I:E:1524:~:4.32" contact_non_aa_B[J:W:1495] = "J:K6P:1701:A:2.54,J:K6P:1701:B:4.13" Note that hydrogens atoms (displayed with transparency) are excluded from distance calculations.

Author contributions

A.A.R., A.M. and A.M.B. conceived the idea, defined the data collection pipeline, verified the data collection processes and wrote the manuscript. A.A.R. and A.M.B. performed the data collection.

Code availability

The code implementing the described data collection and analysis methods is accessible at https://github.com/vistalab-technion/pp5.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Ailie Marx, Email: ailiem@migal.org.il.

Alexander M. Bronstein, Email: alexbronst@gmail.com

References

1.Sun Z, Liu Q, Qu G, Feng Y, Reetz MT. Utility of B-Factors in Protein Science: Interpreting Rigidity, Flexibility, and Internal Motion and Engineering Thermostability. Chemical reviews. 2019;119(3):1626–1665. doi: 10.1021/acs.chemrev.8b00290. [DOI] [PubMed] [Google Scholar]
2.Nussinov R, Liu Y, Zhang W, Jang H. Protein conformational ensembles in function: roles and mechanisms. RSC chemical biology. 2023;4(11):850–864. doi: 10.1039/D3CB00114H. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Lane TJ. Protein structure prediction has reached the single-structure frontier. Nat Methods. 2023;20:170–173. doi: 10.1038/s41592-022-01760-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Gutermuth T, Sieg J, Stohn T, Rarey M. Modeling with Alternate Locations in X-ray Protein Structures. Journal of chemical information and modeling. 2023;63(8):2573–2585. doi: 10.1021/acs.jcim.3c00100. [DOI] [PubMed] [Google Scholar]
5.Hrabe T, et al. PDBFlex: exploring flexibility in protein structures. Nucleic acids research. 2016;44(D1):D423–D428. doi: 10.1093/nar/gkv1316. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Audagnotto M, et al. Machine learning/molecular dynamic protein structure prediction approach to investigate the protein conformational ensemble. Sci Rep. 2022;12:10018. doi: 10.1038/s41598-022-13714-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Keedy DA, et al. Mapping the conformational landscape of a dynamic enzyme by multitemperature and XFEL crystallography. Elife. 2015;30:4. doi: 10.7554/eLife.07574. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Riley BT, et al. qFit 3: Protein and ligand multiconformer modeling for X-ray crystallographic and single-particle cryo-EM density maps. Protein science. 2021;30(1):270–285. doi: 10.1002/pro.4001. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Stachowski TR, Fischer M. FLEXR: automated multi-conformer model building using electron-density map sampling. Acta crystallographica. Section D, Structural biology. 2023;79(Pt 5):354–367. doi: 10.1107/S2059798323002498. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Wankowicz SA, et al. Uncovering Protein Ensembles: Automated Multiconformer Model Building for X-ray Crystallography and Cryo-EM. Elife. 2023;12:RP90606. doi: 10.7554/eLife.90606.3. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Prilusky, J. OCA, a browser-database for protein structure/function. http://oca.weizmann.ac.il and mirrors worldwide. (1996)
12.Rosenberg A, Marx A, Bronstein AA. 2024. catalogue of alternately located segments in protein crystal structures. Harvard Dataverse V1. [DOI] [PMC free article] [PubMed]
13.Berman HM, et al. The Protein Data Bank. Nucleic acids research. 2000;28(1):235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Rose, Y. et al. RCSB Protein Data Bank: Architectural Advances Towards Integrated Searching and Efficient Access to Macromolecular Structure Data from the PDB Archive. Journal of Molecular Biology (2020) [DOI] [PMC free article] [PubMed]
15.Cock PJ, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.UniProt Consortium UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023;51(D1):D523–D531. doi: 10.1093/nar/gkac1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Rosenberg AA, Marx A, Bronstein AM. Codon-specific Ramachandran plots show amino acid backbone conformation depends on identity of the translated codon. Nat Commun. 2022;13(1):2815. doi: 10.1038/s41467-022-30390-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Steinegger M, Söding J. MMseqs. 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35:1026–1028. doi: 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]
19.Adams PD, et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Cryst. 2010;D66:213–221. doi: 10.1107/S0907444909052925. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Winn MD, et al. Overview of the CCP4 suite and current developments. Acta Cryst. 2011;D67:235–242. doi: 10.1107/S0907444910045749. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Rosenberg A, Marx A, Bronstein AA. 2024. catalogue of alternately located segments in protein crystal structures. Harvard Dataverse V1. [DOI] [PMC free article] [PubMed]

Data Availability Statement

The code implementing the described data collection and analysis methods is accessible at https://github.com/vistalab-technion/pp5.

[CR1] 1.Sun Z, Liu Q, Qu G, Feng Y, Reetz MT. Utility of B-Factors in Protein Science: Interpreting Rigidity, Flexibility, and Internal Motion and Engineering Thermostability. Chemical reviews. 2019;119(3):1626–1665. doi: 10.1021/acs.chemrev.8b00290. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Nussinov R, Liu Y, Zhang W, Jang H. Protein conformational ensembles in function: roles and mechanisms. RSC chemical biology. 2023;4(11):850–864. doi: 10.1039/D3CB00114H. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Lane TJ. Protein structure prediction has reached the single-structure frontier. Nat Methods. 2023;20:170–173. doi: 10.1038/s41592-022-01760-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Gutermuth T, Sieg J, Stohn T, Rarey M. Modeling with Alternate Locations in X-ray Protein Structures. Journal of chemical information and modeling. 2023;63(8):2573–2585. doi: 10.1021/acs.jcim.3c00100. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Hrabe T, et al. PDBFlex: exploring flexibility in protein structures. Nucleic acids research. 2016;44(D1):D423–D428. doi: 10.1093/nar/gkv1316. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Audagnotto M, et al. Machine learning/molecular dynamic protein structure prediction approach to investigate the protein conformational ensemble. Sci Rep. 2022;12:10018. doi: 10.1038/s41598-022-13714-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Keedy DA, et al. Mapping the conformational landscape of a dynamic enzyme by multitemperature and XFEL crystallography. Elife. 2015;30:4. doi: 10.7554/eLife.07574. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Riley BT, et al. qFit 3: Protein and ligand multiconformer modeling for X-ray crystallographic and single-particle cryo-EM density maps. Protein science. 2021;30(1):270–285. doi: 10.1002/pro.4001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Stachowski TR, Fischer M. FLEXR: automated multi-conformer model building using electron-density map sampling. Acta crystallographica. Section D, Structural biology. 2023;79(Pt 5):354–367. doi: 10.1107/S2059798323002498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Wankowicz SA, et al. Uncovering Protein Ensembles: Automated Multiconformer Model Building for X-ray Crystallography and Cryo-EM. Elife. 2023;12:RP90606. doi: 10.7554/eLife.90606.3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Prilusky, J. OCA, a browser-database for protein structure/function. http://oca.weizmann.ac.il and mirrors worldwide. (1996)

[CR12] 12.Rosenberg A, Marx A, Bronstein AA. 2024. catalogue of alternately located segments in protein crystal structures. Harvard Dataverse V1. [DOI] [PMC free article] [PubMed]

[CR13] 13.Berman HM, et al. The Protein Data Bank. Nucleic acids research. 2000;28(1):235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Rose, Y. et al. RCSB Protein Data Bank: Architectural Advances Towards Integrated Searching and Efficient Access to Macromolecular Structure Data from the PDB Archive. Journal of Molecular Biology (2020) [DOI] [PMC free article] [PubMed]

[CR15] 15.Cock PJ, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.UniProt Consortium UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023;51(D1):D523–D531. doi: 10.1093/nar/gkac1052. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Rosenberg AA, Marx A, Bronstein AM. Codon-specific Ramachandran plots show amino acid backbone conformation depends on identity of the translated codon. Nat Commun. 2022;13(1):2815. doi: 10.1038/s41467-022-30390-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Steinegger M, Söding J. MMseqs. 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35:1026–1028. doi: 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Adams PD, et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Cryst. 2010;D66:213–221. doi: 10.1107/S0907444909052925. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Winn MD, et al. Overview of the CCP4 suite and current developments. Acta Cryst. 2011;D67:235–242. doi: 10.1107/S0907444910045749. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A dataset of alternately located segments in protein crystal structures

Aviv A Rosenberg

Ailie Marx

Alexander M Bronstein

Abstract

Background & Summary

Methods

Raw data collection

Altloc collection

Aligning to uniprot sequences

Backbone locations and dihedral angles per altloc

B-factors, location standard deviations and distances between altlocs

Contacts

Codon assignment

Removal of low-quality structures

Fig. 1.

Non-redundant cluster assignment

Segmentation of contiguous altlocs

Data Records

Data

Table 1.

Table 2.

Metadata

Table 3.

Technical Validation

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Author contributions

Code availability

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Citations

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases