Abstract
Root mean square displacement (RMSD) calculations play a fundamental role in the comparison of different conformers of the same ligand. This is particularly important in the evaluation of protein-ligand docking, where different ligand poses are generated by docking software and their quality is usually assessed by RMSD calculations. Unfortunately, many RMSD calculation tools do not take into account the symmetry of the molecule, remain difficult to integrate flawlessly in cheminformatics and machine learning pipelines—which are often written in Python—or are shipped within large code bases. Here we present a new open-source RMSD calculation tool written in Python, designed to be extremely lightweight and easy to integrate into existing software.
Keywords: RMSD, Symmetry, Software, Python
Introduction
Computational structure-based drug discovery has steadily gained traction partially thanks to the constant improvements in available software, now often free and open source. Protein-ligand docking in particular is now a standard tool employed in the early stages of drug discovery pipelines in order to screen possible drugs acting on a known target of interest.
Protein-ligand docking consists of the prediction of binding modes and binding affinity of a (flexible) ligand to a target of known structure. The performance of docking programs is often assessed by their ability to reproduce the crystallographic pose of the bound ligand. A common metric to evaluate the difference between the predicted binding pose and the crystallographic pose is the heavy-atoms root mean square displacement (RMSD) [1], although other metrics have been suggested [2]. RMSD calculations are also used in other contexts, for example for the evaluation of diversity in generated conformers [3].
Many simple scripts to compute RMSDs are based on the assumption of a direct one-to-one mapping between atoms of different conformers of the same ligand. In different words, atoms are often assumed to be labelled according to their position in a coordinate file (or data structure) and they are paired according to such label. This assumption breaks down when such labels are not conserved—i.e. the order of atoms is different in the two structures being compared—and/or for symmetric molecules. In the case of symmetric molecules, different binding poses can be chemically identical but different in terms of atom-atom mapping. Since molecular connectivity is naturally represented by graphs (atoms as vertices and bonds as edges), tools from graph theory can be used to obtain the correct atom-atom mapping for two different conformers of the same molecule, thus avoiding the problems outlined above.
Here we present a new Python tool, spyrmsd, for the calculation of symmetry-corrected RMSDs based on graph isomorphisms.
Implementation
spyrmsd is implemented in pure Python and therefore it is easy to integrate in existing Python libraries and Python pipelines, particularly common in cheminformatics and machine learning projects. In this section we describe the implementation of the different types of RMSD calculations implemented in spyrmsd, their use, and their shortcomings.
Standard RMSD
Let us call and the matrices of atomic coordinates of two conformers A and B of the same molecule. The standard RMSD is simply defined as
If we define the displacement —where is the i-th row of and is the i-th row of —the standard RMSD can be written more compactly as
1 |
This simple formula, which assumes the atomic coordinates to be provided in the same order for both conformers, is easy to compute. In spyrmsd the calculation of is vectorised using numpy [4] for speed.
A serious drawback of standard RMSD calculations is that they do not take into account molecular symmetry. This is problematic since atoms of the same specie are intrinsically indistinguishable and therefore symmetry operations conserve molecular properties. Figure 1 shows the atom–atom mapping for benzene with and without symmetry correction after a mirror operation; it is clear that a simple positional atom–atom mapping leads to artificially inflated results. Symmetry corrections would lead to the correct result expected with indistinguishable atoms.
Quaternion characteristic polynomial method
Standard RMSD calculations take into account the possible translations between the two conformers. In order to measure conformational similarity—and neglect translations—the RMSD can be computed on optimally superimposed structures. This minimised RMSD for a pair of molecules can be computed efficiently using the quaternion characteristic polynomial (QCP) method [5], which circumvent the need of finding orthogonal (rigid-body) rotations and special considerations for edge cases.
The QCP methods is based on the calculation a symmetric key matrix [5]
where
The minimum RMSD is then given by [5]
2 |
where , and is the maximum eigenvalue of . The eigenvalues of can be obtained by finding the roots of the characteristic polynomial [6], where is the identity matrix. For the matrix the characteristic polynomial is given by [5]
where (with ), and . The largest characteristic polynomial root can be efficiently computed using the Newton-Raphson method [7] starting from the initial guess [5].
Care should be taken when the two molecules A and B overlap perfectly. In such case, and therefore the term in Eq. (2) can become negative due to numerical errors.
In spyrmsd the solution of the characteristic polynomial equation is based on the Newton–Raphson method implemented in scipy [8] while other vector and matrix operations are vectorised using numpy.
Hungarian algorithm for symmetry correction
The Hungarian algorithm [9, 10] is an algorithm to solve the linear sum assignment problem [11] (also known as minimum weight matching in bipartite graphs) and has been previously proposed as a method to introduce symmetry corrections in RMSD calculations [12]. If is the matrix of squared pairwise distances between all atoms of the conformer A to all atoms of the conformer B, the linear weight assignment problem consists in finding the assignment matrix that minimises the assignment cost , where if and only if atom i of conformer A is assigned to atom j of conformer B. The RMSD computed using the Hungarian algorithm is therefore given by
under the constraint that each row is assigned to exactly one column and each column to exactly one row. This definition is however problematic, since the solution of the assignment problem could end up pairing atoms of different elements. In order to avoid this drawback, the assignment problem is solved for every element separately [12]. If is the matrix of squared pairwise distances between atoms of element e of conformer A to atoms of element e of conformer B, the RMSD computed using the Hungarian algorithm is given by
where if and only if atom i of element e in conformer A is assigned to atom j of element e in conformer B and were the sum on e runs over all elements of the molecule.
Even when the Hungarian algorithm is used to assign atoms of the same element, problems can arise from the fact that the algorithm is not aware of the overall molecular structure. This could result in unphysical assignments which break the molecular graph and result in artificially low RMSD values [13]. Figure 2 shows a simple situation where unphysical assignments arise and lead to a RMSD value lower than the correct one.
Graph isomorphisms for symmetry correction
In order to overcome the problems of the Hungarian algorithm, tools from graph theory can be borrowed to perform an optimal atom-atom assignment based on graph isomorphisms. A molecule can be represented as a graph —hereafter referred to molecular graph—where the vertices V are associated to atoms and the edges E are associated to bonds. If two conformers A and B are represented by graphs and , respectively, the mapping of atoms of molecule A to atoms of molecule B becomes a graph isomorphism problem. An isomorphism between graphs and is a bijective mapping of the vertices of graph A to vertices of graph B that preserves the edge structure of the graphs (molecular connectivity in the case of molecular graphs). With the bijective mapping connecting vertices (atoms) of to vertices (atoms) of the RMSD between the two molecules can be computed using the standard RMSD formulation of Eq. (1).
spyrmsd can leverage networkx [14] or graph-tool [15] for graph representation and graph matching. All possible graph isomorphisms are computed using the VF2 algorithm [16] and the lowest RMSD among all isomorphisms is retained.
The graph isomorphism problem is a non-polynomial (NP) problem and therefore symmetry-corrected RMSD calculations are only suited for small to medium sized molecules. In order to improve speed, graph isomorphisms are cached by default when computing the RMSD between multiple conformations of the same molecule.
API
The main module of spyrmsd is the rmsd module, where all the high-level RMSD functions are implemented. The following functions are available to the user:
rmsd for the computation of the standard RMSD,
hrmsd for the computation of RMSD using the Hungarian algorithm,
symmrmsd for the computation of symmetry-corrected RMSD,
symmrmsd should always be used for small molecules, in order to get the right symmetry-corrected RMSD. rmsd is provided to compute the standard RMSD when symmetry does not play a role (or when the molecular graph is too large to efficiently apply symmetry-corrections) and atoms are listed in the same order. hrmsd is provided for comparison with existing implementations and should not be used otherwise, because of the problems outlined above.
The minimum RMSD (computed using the QCP method [5]) can be obtained with the keyword minimize=True, with and without symmetry-corrections.
spyrmsd is designed to be easily integrated in existing Python libraries or pipelines. For this reason the application programming interface (API) is minimalistic: only atomic coordinates and atomic numbers (rmsd and hrmsd), and molecular adjacency matrices (symmrmsd) have to be passed to RMSD functions in the form of numpy arrays. This simple API makes spyrmsd completely agnostic of the way molecules are stored in different software, as long as they can provide the minimal information required.
Standalone RMSD tool
spyrmsd also offers a standalone RMSD tool as a command line interface (CLI) exposing the functionality of the rmsd and symmrmsd functions. The hrmsd function is not exposed in the CLI, to avoid erroneous calculations.
In the standalone tool, molecular input is handled by OpenBabel [17] (via its Python interface pybel [18]) or RDKit [19]. Such packages are also responsible for building the adjacency matrix representing the molecular graph.
OpenBabel’s own RMSD calculation tool, obrms, is expected to be faster than spyrmsd as a standalone tool since it does not have any Python overhead.
Results
Testing
In order to test the correctness of spyrmsd against OpenBabel’s obrms we redocked the PDBbind refined set [20, 21] with smina [22] to generate different ligand conformations. Ligand SDF files were downloaded directly from the PDB [23] in order to avoid problems with the connectivity present in the original PDBbind dataset. smina was run using the default settings with protein PDB files stripped of water molecules. The top 10 binding poses were retained, resulting in a total of 40,439 different conformations. The RMSD of each docked pose with respect to the crystal pose was computed using symmetry-corrected RMSD with and without minimisation (using the QCP method [5]).
Figures 3 and 4 show the relationship between RMSDs obtained with spyrmsd and obrms with and without minimisation. The RMSDs computed with the two softwares correlates perfectly (Spearman’s correlation coefficient of 1.00) and present a maximum absolute error of . This gives us great confidence that the two independent implementations are equivalent (Additional file 1).
A comparison between the Hungarian method and symmetry-corrected RMSD obtained via graph isomorphisms is presented in Fig. 5. As previously pointed out, the Hungarian method can result in assignments incompatible with the molecular connectivity and therefore leads to artificially low RMSD values [13]. Therefore, the hrmsd function is provided only for comparison with existing software and should not be used otherwise.
Speed
By design, spyrmsd is written fully in Python and leverages fast libraries that are easy to install (using the pip or conda package managers). This means that there is some overhead compared to the most efficient implementations in other compiled libraries.
Figure 6 shows a speed comparison between spyrmsd and obrms for 100 randomly selected systems. Error bars are obtained by repeating the measurements 25 times. spyrmsd is usually comparable or an order of magnitude slower than obrms. This is expected since Python comes with some overhead compared to compiled code. The difference between the graph-tool and networkx backends is more difficult to elucidate: graph-tool seems to be generally slightly faster, but networkx has clearly more variation from system to system (see Fig. 7).
Benchmarking was performed on an Apple MacBook Pro (macOS 10.15) with a 2.6 GHz 6-Core Intel Core i7 processor and 32 GB of 2400 MHz DDR4 memory (Additional file 2).
Discussion
Despite being somewhat slower than other state-of-the-art tools for RMSD calculation, we believe that spyrmsd could be extremely useful to the community: it is a lightweight tool with focussed functionality, it is easy to use and integrate in existing Python codebases and pipelines, and it is easy to install via popular package managers.
Easy installation
spyrmsd is available on the Python Package Index (PyPI) [24] and via the conda package manager [25] on the conda-forge channel [26]. This provides easy cross-platform installation of spyrmsd and all its dependencies to work as a library (with networkx). On macOS and Linux, users can get some speed improvement by installing graph-tool, which is also available via the conda package manager.
In order to use spyrmsd as a standalone tool, users will have to install either OpenBabel or RDKit with their preferred installation method.
Easy integration in existing libraries
spyrmsd is easy to integrate into existing pipelines thanks to its clean and simple API. Standard RMSD calculations require atomic coordinates and atomic numbers only, while symmetry-corrected RMSD calculations also require adjacency matrices in order to compute graph isomorphisms. Atomic coordinates and atomic numbers are usually readily available in most Python libraries dealing with molecular file formats, while the adjacency matrix of a molecule is easy to build from bond connectivity.
We believe that the simple API will favour the integration of spyrmsd in many existing libraries, bringing symmetry-corrected RMSD calculations to widely used packages.
Software best practices
The development of spyrmsd is based on modern software engineering best practices. The code is version-controlled using git [27] and it is freely available on GitHub (https://github.com/RMeli/spyrmsd) [28], released under the open-source and permissive MIT license.
The code is extensively tested using pytest [29]. Tests are run automatically every time a new version of the code is pushed to GitHub thanks to Travis-CI bindings for continuous integration [30]. The code coverage of the test suite is reported on Codecov [31], which provides easy-to-read reports. A code coverage of 100% is targeted, so that all lines of code are executed at least once during tests.
The code is compatible with Python 3.6 or above. Static analysis tools are constantly applied to the code in order to catch errors that would be otherwise missed or discovered only during execution. We use mypy to perform static checks [32] and flake8 to detect style and formatting issues. Such tools help maintaining correctness and stability for future developments as well as a clean codebase.
Finally, the code is documented using Python docstrings and the documentation is built automatically using sphinx [33]. This will likely make it easier to fully understand the codebase thus facilitating the adoption of spyrmsd by other libraries.
Conclusion
spyrmsd provides robust symmetry-corrected RMSD calculations with a clean and simple API that is easy to integrate in existing Python libraries and pipelines. We believe that such a tool could be useful to the wider community of molecular modellers and cheminformaticians.
Future development of the software will focus on improved automatic bond perception (to automatically build molecular adjacency matrices) and speed.
Availability and requirements
Project name: spyrmsd
Operating systems: Linux, macOS, Windows
Programming language: Python
Other requirements: Python 3.6 or higher
License: MIT
Supplementary information
Acknowledgements
This work was supported by funding from the Biotechnology and Biological Sciences Research Council (BBSRC) [BB/MO11224/1] National Productivity Investment Fund (NPIF) [BB/S50760X/1] and Evotec (UK) via the Interdisciplinary Biosciences DTP at the University of Oxford. The authors acknowledge fruitful interactions with Dr. Irfan Alibay.
Authors' contributions
RM wrote spyrmsd. RM and PCB prepared the manuscript. Both authors read and approved the final manuscript.
Data availibility statement
spyrmsd is available for download on PyPI and conda-forge, while the source code is available on GitHub under the MIT license: https://github.com/RMeli/pyrmsd. The code used to produce the figures is included in the supplementary information. The results of docking are available on Zenodo (https://doi.org/10.5281/zenodo.3747315).
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary information accompanies this paper at 10.1186/s13321-020-00455-2.
References
- 1.Mukherjee S, Balius TE, Rizzo RC. Docking validation resources: protein family and ligand flexibility experiments. J Chem Inf Model. 2010;50:1986–2000. doi: 10.1021/ci1001982. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Leung S, Bodkin M, von Delft F, Brennan P, Morris G. Sucos is better than rmsd for evaluating fragment elaboration and docking poses. ChemRxiv. 2019 doi: 10.26434/chemrxiv.8100203.v1. [DOI] [Google Scholar]
- 3.O’Boyle NM, Vandermeersch T, Flynn CJ, Maguire AR, Hutchison GR. Confab—systematic generation of diverse low-energy conformers. J Cheminf. 2011;3(1):8. doi: 10.1186/1758-2946-3-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.van der Walt S, Colbert SC, Varoquaux G. The numpy array: A structure for efficient numerical computation. Comput Sci Eng. 2011;13:98. doi: 10.1109/MCSE.2011.67. [DOI] [Google Scholar]
- 5.Theobald DL. Rapid calculation of rmsds using a quaternion-based characteristic polynomial. Acta Cryst A. 2005;61:478–480. doi: 10.1107/S0108767305015266. [DOI] [PubMed] [Google Scholar]
- 6.Roman S. Advanced linear algebra. Berlin: Springer; 2007. [Google Scholar]
- 7.Quarteroni A, Saleri F. Numerical mathematics. Berlin: Springer; 2007. [Google Scholar]
- 8.Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Jarrod Millman K, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat I, Feng Y, Moore EW. Scipy 1.0–fundamental algorithms for scientific computing in python. Nat Methods. 2019;17:261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kuhn HW. The hungarian method for the assignment problem. Nav Res Logist Q. 1955;2:83–97. doi: 10.1002/nav.3800020109. [DOI] [Google Scholar]
- 10.Munkres J. Algorithms for the assignment and transportation problems. J Soc Indus Appl Math. 1957;5:32–38. doi: 10.1137/0105003. [DOI] [Google Scholar]
- 11.Ignazio J, Cavalier TM. Linear programming. New York: Prentice-Hall; 1994. [Google Scholar]
- 12.Allen WJ, Rizzo RC. Implementation of the hungarian algorithm to account for ligand symmetry and similarity in structure-based design. J Chem Inf Model. 2014;54:518–529. doi: 10.1021/ci400534h. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Bell EW, Zhang Y. Dockrmsd: an open-source tool for atom mapping and rmsd calculation of symmetric molecules through graph isomorphism. J Cheminf. 2019;11:9. doi: 10.1186/s13321-019-0362-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hagberg AA, Schult DA, Swart PJ (2008) Exploring network structure, dynamics, and function using networkx. In: Proceedings of the 7th Python in Science Conference. p. 11–5
- 15.graph-tool: Efficient network analysis. https://graph-tool.skewed.de/
- 16.Cordella LP, Foggia P, Sansone C, Vento M. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans Pattern Anal Mach Intell. 2004;26:1367–1372. doi: 10.1109/TPAMI.2004.75. [DOI] [PubMed] [Google Scholar]
- 17.O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open babel: An open chemical toolbox. J Cheminf. 2011;33:121. doi: 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.O’Boyle NM, Morley C, Hutchison GR (2008) Pybel: a python wrapper for the openbabel cheminformatics toolkit. Chem Cent J 2, 5 [DOI] [PMC free article] [PubMed]
- 19.Rdkit: Open-source cheminformatics software. http://www.rdkit.org/
- 20.Wang R, Fang X, Lu Y, Wang S. The pdbbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. J Med Chem. 2004;47:2977–2980. doi: 10.1021/jm030580l. [DOI] [PubMed] [Google Scholar]
- 21.Liu Z, Li Y, Han L, Li J, Liu J, Zhao Z, Nie W, Liu Y, Wang R. Pdb-wide collection of binding data: current status of the pdbbind database. Bioinformatics. 2014;31:405–412. doi: 10.1093/bioinformatics/btu626. [DOI] [PubMed] [Google Scholar]
- 22.Koes DR, Baumgartner MP, Camacho CJ. Lessons learned in empirical scoring with smina from the csar 2011 benchmarking exercise. J Chem Inf Model. 2013;58:1893–1904. doi: 10.1021/ci300604z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.RCSB protein data bank. https://www.rcsb.org/
- 24.PyPI: Find, install and publish python packages with the python package index. https://pypi.org/
- 25.conda: Package, dependency and environment management for any language. https://conda.io/en/latest/
- 26.conda-forge: A community-led collection of recipes, build infrastructure and distributions for the conda package manager. https://conda-forge.org/
- 27.Chacon S, Straub B (2014) Pro git. Apress
- 28.GitHub. https://github.com/
- 29.Krekel H, Oliveira B, Pfannschmidt R, Bruynooghe F, Laugher B, Bruhin F (2014) pytest. https://github.com/pytest-dev/pytest
- 30.Travis CI. https://travis-ci.org/
- 31.Codecov. https://codecov.io/
- 32.mypy: Optional static typing for python. http://mypy-lang.org/
- 33.Sphinx: Python documentation generator. https://www.sphinx-doc.org/
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
spyrmsd is available for download on PyPI and conda-forge, while the source code is available on GitHub under the MIT license: https://github.com/RMeli/pyrmsd. The code used to produce the figures is included in the supplementary information. The results of docking are available on Zenodo (https://doi.org/10.5281/zenodo.3747315).