Principal Component Analysis on a series of molecular geometries (e.g., a reaction coordinate or trajectory) provides maximum structural variance in the fewest dimensions, and so can offer an objective, comprehensible depiction of the transformation.
Abstract
Most chemical transformations (reactions or conformational changes) that are of interest to researchers have many degrees of freedom, usually too many to visualize without reducing the dimensionality of the system to include only the most important atomic motions. In this article, we describe a method of using Principal Component Analysis (PCA) for analyzing a series of molecular geometries (e.g., a reaction pathway or molecular dynamics trajectory) and determining the reduced dimensional space that captures the most structural variance in the fewest dimensions. The software written to carry out this method is called PathReducer, which permits (1) visualizing the geometries in a reduced dimensional space, (2) determining the axes that make up the reduced dimensional space, and (3) projecting the series of geometries into the low-dimensional space for visualization. We investigated two options to represent molecular structures within PathReducer: aligned Cartesian coordinates and matrices of interatomic distances. We found that interatomic distance matrices better captured non-linear motions in a smaller number of dimensions. To demonstrate the utility of PathReducer, we have carried out a number of applications where we have projected molecular dynamics trajectories into a reduced dimensional space defined by an intrinsic reaction coordinate. The visualizations provided by this analysis show that dynamic paths can differ greatly from the minimum energy pathway on a potential energy surface. Viewing intrinsic reaction coordinates and trajectories in this way provides a quick way to gather qualitative information about the pathways trajectories take relative to a minimum energy path. Given that the outputs from PCA are linear combinations of the input molecular structure coordinates (i.e., Cartesian coordinates or interatomic distances), they can be easily transferred to other types of calculations that require the definition of a reduced dimensional space (e.g., biased molecular dynamics simulations).
1. Introduction
Chemical reaction pathways and structural transformations occurring on hyperdimensional potential energy surfaces (PESs) can be difficult to comprehend due to the high number of degrees of freedom available in most molecular systems. The use of reaction coordinate diagrams and reduced dimensional potential energy surface scans1 (RDPESs) has already demonstrated the utility of viewing chemical reactions in a small number of dimensions. These approximate RDPESs are often made by incrementally varying a small number of geometric features and plotting the values of potential energy as a function of these features to generate a low-dimensional surface. For example, a recent paper by Liu et al. details a method of using RDPESs on which to conduct ab initio molecular dynamics (MD) simulations where the RDPESs were constructed using geometric coordinates “chosen based on the chemical knowledge of the system.”2 In addition to generating RDPESs, similar approaches (e.g., choosing specific bond distances, angles, and dihedrals along the course of trajectories as in ref. 3–8) are often used to plot several MD trajectories, and to carry out free energy sampling [e.g., using methods like umbrella sampling,9 metadynamics,10 boxed molecular dynamics (BXD),11,12 forward flux sampling,13 milestoning,14 all of which require a well-defined reduced dimensional space of collective variables from which to sample]. In general, these sorts of analyses tend to rely heavily on user input, i.e., the person making the surface uses their chemical intuition to pick geometric criteria that will make the analysis useful. However, by inferring the geometric changes most important to a reaction and calculating the energy of structures along those coordinates, one runs the risk of confirming one's own biases, and neglecting potentially important degrees of freedom. In a variety of realms, it is therefore useful to have an automated method for generating low-dimensional representations to describe structural changes along molecular pathways that is quantitatively and a priori derived from the input data.
In this article, we outline a dimensionality reduction method incorporating principal component analysis (PCA). PCA is an extremely popular method in various fields: in experimental biology, PCA is used to determine the effects of different gene expressions.15–17 In analytical chemistry, PCA is central in the development of quantitative structure activity relationship (QSAR) models, of particular utility in the pharmaceutical industry.18–21 Perhaps most closely related to this study is the use of PCA in computational biology, to capture essential motions of a protein in MD simulations.22–24 There are, however, still some key limitations of PCA: first, it is assumed that the relationships between features of the data points are linear. Second, principal components must be orthogonal to one another, so some types of coupled motions may not be well-described (i.e., related to the first point, motions that are coupled in non-linear relationships). Third, because PCA aims to pick principal components along which the variance of the data is maximized, some shapes of the data distribution can end up being described poorly (e.g., two “bands” of data, or stacked “pancakes” of data points).25–27 Despite these limitations, for the applications described herein, PCA does an excellent job of defining a reduced dimensional space, without losing too much structural information along the chemical pathways examined, and the issue of capturing non-linear motions can be mitigated by adjusting the representations of molecular structures that are input to PCA. Despite its utility and the fact that reaction coordinates of small-molecule systems are not as susceptible as those of larger systems to suffer from the aforementioned limitations, as far as we know, PCA is not commonly utilized for the visualization of small-molecule chemical change.
For computational studies of large biomolecular systems occurring over long timescales, a suitable choice of collective variables is necessary for modelling dynamics, and thus many dimensionality reduction techniques in addition to PCA have been explored in the field. For example, in the realm of Markov state models, many in the computational community have chosen to employ time-lagged (or time-structure based) independent component analysis (TICA)28 rather than PCA. TICA aims to maximize the autocorrelation for a given lag time, rather than the variance, and so is better able to resolve slow timescale events, which is better for capturing the slow dynamics of large molecules like enzymes.26,29 Diffusion maps constitute a dimensionality reduction technique that does not assume the data points to be related linearly, but instead seeks to determine the manifold in which the data live.30–32 For the small-molecule applications discussed below, where we are not considering very large systems occurring over large timescales and particularly because we are focusing on intrinsic reaction coordinates (IRCs) rather than MD trajectories to define a reduced dimensional space, we chose to use PCA in order to determine the optimal reduced dimensional space for these example systems. The methods described herein are provided in an open-source software package named PathReducer, which allows the user to decide whether their system is best described by linear combinations of Cartesian coordinates or squared interatomic distances, and also whether they would like these inputs to be mass-weighted prior to processing. The merits of all options as applied to several example systems are discussed in the results section, below.
In this paper, we have three principal goals. The first is to introduce the application of PCA into the field of small-molecule computational chemistry, where its value may not have been as widely recognized as it has been in computational biology. The second is to show the utility of using PCA to analyze and characterize chemical pathway data. In particular, we show that a variant of PCA in which the input data are squared internal distances can have advantages over the version in which Cartesian coordinates are used. Additionally, by using a reduced dimensional space defined by an IRC and projecting MD trajectory data into this space, one can quickly classify the routes taken by trajectories compared to the minimum energy path. The third objective is to provide our code, PathReducer: an easy-to-use code for computational chemists to reduce the dimensionality of their molecular systems.
2. PathReducer: dimensionality reduction software
The methods described below are freely available in an open source Python package named PathReducer, with further details in the ESI.† While there are many dimensionality reduction packages already available in the scikit-learn33,34 library in Python, the present software is specifically designed to process trajectories of small molecules and generate visualizations thereof. The RMSD Python package, which calculates the RMSD between structures and does alignments using a variety of possible methods, was also utilized in the making of this code for structural alignments using the Kabsch algorithm.35 A flowchart illustrating how PathReducer works is shown in Fig. 1.
2.1. Input
PathReducer takes as input the following:
(a) A series of molecular geometries (e.g., an IRC, a trajectory, a relaxed potential energy surface scan) in xyz file format;
(b) ndim, the number of dimensions for the low-dimensional space (often two or three dimensions would be most useful for visualization);
(c) Whether the user wants PCA analysis to be carried out on mass weighted input coordinates;
(d) Optional labels of four atoms surrounding a stereogenic center of the molecule in order to define chirality (this is only necessary when defining the molecular structures as squared interatomic distance matrices, discussed in more detail below);
(e) The representation of the IRC/trajectory upon which the user wants to perform PCA. The user can specify that PCA be performed on the aligned Cartesian coordinates of the structures (keyword “Cartesians”) or on the upper triangle of the squared interatomic distance matrices of the structures (keyword “Distances”).
The full distance matrix representation is less suitable for very large systems as the size of the representations scales as , with N being the number atoms, whereas the aligned Cartesian coordinate representation scales with . Using internal distances, however, provides a more accurate reduced dimensional representation in fewer dimensions when non-linear motions (e.g. torsions) are involved in the reaction pathway. Additionally, the output from using interatomic distance matrices as input to PCA is more suitable for use in free energy sampling methods since the representation is rotationally and translationally invariant.
2.2. Pre-processing
Both methods have the option to mass-weight the Cartesian coordinates prior to processing by PCA, but mass-weighting must occur after structural alignment. If the specified input is “Cartesians”, the Cartesian coordinates of the structures are represented as 3N-dimensional vectors and aligned using the Kabsch algorithm (step 1C in Fig. 1).36 If the user chooses to mass-weight, the Cartesian coordinates are at this point transformed according to the following equation:
1 |
where ξ is the 3N-dimensional vector containing the mass-weighted coordinates for a single structure along the IRC/trajectory, mN is the mass of atom N, and N is the number of atoms in the system (MW step in Fig. 1). If the specified input is “Distances”, rather than using the 3N-dimensional aligned Cartesian coordinate vectors to represent each structure along the IRC, each structure is represented as a squared internal distance matrix with each element representing the squared Euclidean distance between an atom pair of the molecule, generating an (N × N)-dimensional distance matrix for each input structure (step 1D in Fig. 1). Because each interatomic distance matrix is symmetric with its diagonal elements being zero, the upper triangle of each matrix can be flattened to a vector of length containing all of the pairwise distances.
2.3. Processing
The data processing step (step 2 in Fig. 1) involves performing PCA on the [n × 3N] or -dimensional matrix of structures, n being the number of structures from the input xyz file. Because PCA is well-described in the literature,25,27 we will only give a brief summary of the method here. PCA takes a set of n observations with p variables (in our case, n structures along an IRC/trajectory with 3N Cartesian coordinates or interatomic distances) and returns an orthogonal basis that maximizes the variance captured by the minimum number of principal components. This transformation is accomplished by a diagonalization of the mean-centered covariance matrix C to generate a new orthogonal coordinate system as follows:
ΛC = UCCUTC, | 2 |
where UC is the matrix of eigenvectors, each of which represents a new coordinate that corresponds to a linear combination of the original variables, and ΛC is the diagonal matrix of the corresponding eigenvalues (λC) of C. In this case, the principal components are linear combinations of Cartesian coordinates or squared interatomic distances. The corresponding eigenvalues correspond to the proportion of the total variance of the system that is captured by each eigenvector. The amount of variance captured by each eigenvector is contained in the eigenvector's corresponding eigenvalue, λC. What is often referred to as the “goodness of fit” (G.o.F.) or the “variance explained” by the reduced dimensional model corresponds to the sum of the eigenvalues of the number of eigenvectors used in the reduced dimensional space (that is, the fraction of variance captured by the ndim principal components chosen):37
3 |
2.4. Reconstruction
The reduced-dimensional IRC/trajectory can then be transformed back into the original, full-dimensional space to reconstruct the effect of individual principal components on the molecular geometries using the following expression (step 3 in Fig. 1):
X[combining tilde] = Ti·Wi + X[combining macron], | 4 |
where X[combining tilde] is the [n × 3N] or -dimensional matrix of reduced dimensional structures transformed into the original, full-dimensional space, Ti is the [n × 1]-dimensional matrix of structures represented by the ith principal component, Wi is the [1 × 3N] or -dimensional matrix corresponding to weights of the ith principal component, and X[combining macron] is the mean structure of the original dataset. Similarly, the following expression is used to reconstruct the combined effect of the ndim principal components:
X[combining tilde] = Ti:ndim·Wi:ndim + X[combining macron], | 5 |
where Ti:ndim is the [n × ndim]-dimensional matrix of structures represented by all ndim principal components and Wi:ndim is the [ndim × 3N] or -dimensional matrix containing the weights of the ndim principal components.
If using a “Cartesians” input to PCA, this is the last step prior to generating output because the reconstructed structures are in Cartesian space. In the case of using the “Distances” input, the structures that have been transformed into reduced dimensional space at this point are still vectors representing the upper triangle of interatomic distance matrices, and so each row then needs to be converted from squared distances to Cartesian coordinates.38 These steps represent the most computationally expensive part of the procedure, as a matrix diagonalization must be done for each molecular structure (step 7D in Fig. 1). The reconstruction of Cartesian coordinates from the flattened, reduced dimensional distance matrices requires the following: each vector is converted back into a square, symmetric matrix with zeroes along the diagonal (step 4D in Fig. 1). The Gram matrix, G, for each internal distance matrix is then calculated by:
6 |
where D represents the interatomic distance matrix and d1 is the first column of D (step 5D in Fig. 1). An eigenvalue decomposition (EVD) is then conducted on G (step 6D in Fig. 1) as follows:
ΛG = UGGUTG | 7 |
The approximate reconstruction of the Cartesian coordinates is given by the first three columns of the matrix generated by taking dot product of the eigenvectors and the square root of their corresponding eigenvalues, Λ1/2GUTG. It should be noted that because the reduced dimensional distance matrix, D, is not a true distance matrix, but rather what is referred to as a “predistance matrix”,39 there will be trailing values in the reconstruction matrix Λ1/2GUTG beyond the first three columns that are a result of the fact that some structural information is lost by reducing the dimensionality of the system. If D was a true distance matrix, only the first three columns of Λ1/2GUTG would be nonzero. Additionally, because information about the absolute rotational/reflective configuration is also lost in representing each of the structures as internal distance matrices, these structures will be in an arbitrary rotational/reflective configuration. For the sake of visualization, the Kabsch algorithm,36 which determines the optimal rotation matrix to minimize RMSD between pairs of points, is used to align structures along the IRC.
Structures along the reconstructed pathway are reflected if the chirality of the structure at a particular point is not consistent with the analogous structure in the original file (step 7D in Fig. 1). The optional input of four atoms surrounding the stereogenic center are used to determine the chirality of the structure at each point by the method in ref. 40. The sign of the following fourth-grade determinant is used to assign the chirality of the structure:
8 |
where xi, yi, and zi represent the Cartesian coordinates of the four atoms surrounding the stereogenic center. This determinant will only be equal to zero when the four atoms used to assign the molecule's chirality are in the same plane.
If coordinates were mass-weighted, mass-weighting of the coordinates is removed according to the following equation (step UMW in Fig. 1):
9 |
where vPCi is the 3N-dimensional vector containing the Cartesian coordinates for a structure along the reaction pathway in PCi, ξj is the jth component of the 3N-dimensional vector containing the mass-weighted coordinates for a single structure along the IRC/trajectory, mN is the mass of atom N, and N is the number of atoms in the system. Finally, structures along the reconstructed pathway are aligned using the Kabsch algorithm (step 8D in Fig. 1).
2.5. Output
PathReducer generates a total of (ndim + 1) xyz files from the Cartesian coordinates of the principal components (PCs): the ndim PCs individually transformed into the full-dimensional space, as well as the combination of all ndim PCs transformed back into the full-dimensional space. These files show the effect of each principal component on the geometries along the trajectory. A plot of the IRC/trajectory in the reduced dimensional space defined by the top two and three PCs is also generated (see below for examples).
3. Applications to chemical systems
To illustrate the output of PathReducer, we show four examples of systems on which we conduct dimensionality reduction. The first two, “malonaldehyde” and “SN2”, are prototypical test systems that have been previously used by Tsutsumi et al. to illustrate their dimensionality reduction approach.41,42 The third is a simple torsional rotation of N2O-appended acrylonitrile. The last example is the opening of substituted cyclopropylidene to generate chiral allenes.43 The results discussed below utilize coordinates that were not mass-weighted. The mass-weighting option is included in case the user wants to define a reduced dimensional space for which the calculated kinetic energy is not dependent on mass. As we were not interested in calculating kinetic energy in our reduced dimensional space, and because some of the systems below include hydrogen movements along the reaction coordinate that we did not want to be dwarfed by the movements of heavy atoms, we chose not to mass-weight the coordinates prior to PCA. Mass-weighting does change the results of the dimensionality reduction, as scaling the data on which PCA is conducted changes the reduced dimensional space. In terms of visualization of the pathway in the reduced dimensional space, mass-weighting will give precedence to the movement of heavier atoms; that is, heavier atoms will contribute more to the structural variance along the chemical pathway, which will be reflected in the PCs. For this reason, care should be taken when deciding whether or not it is appropriate to mass-weight the coordinates prior to PCA. Mass-weighting would not be appropriate, for example, when hydrogen movements play a large role in the chemical pathway. See the ESI† for mass-weighted results for all of the example systems below.
3.1. Quantum mechanical methods for generating IRCs and trajectories
Gaussian 09 (ref. 44) was used to generate the example IRCs shown below. The malonaldehyde, SN2, and cyclopropylidene IRCs were calculated using the MP2 method45 with the 6-31+G(d,p) basis set. The MD trajectory for the SN2 and cyclopropylidene bifurcation systems were calculated using the Born–Oppenheimer Molecular Dynamics (BOMD) functionality in Gaussian 09 at the same level of theory as their IRCs. It should be noted that while ab initio quantum chemistry methods were used to generate IRCs and MD trajectories in this case, this analysis is not specific to a particular type of calculation or level of theory. All that is needed as input to the method is one or more files containing molecular structures in xyz file format illustrating the transformation(s) of interest.
3.2. Malonaldehyde
Intramolecular hydrogen transfer between the two oxygens of malonaldehyde is one of the most studied systems in reaction dynamics, owing to the fact that the reaction coordinate is symmetric about the transition state structure, generating indistinguishable molecules. The IRC for this reaction, as well as reactant, transition state, and product structures, can be seen in Fig. 2.
Fig. 3a and b show the results obtained when PathReducer is used to represent the structures along the malonaldehyde IRC as squared internal distance matrices that are input to PCA. Fig. 3c shows that the first principal component (PC1) describes 87.0% of the variance, while PC2 accounts for 12.8%. As these components capture more than 99% of the total variance in the geometrical changes along the IRC, we conclude that the important molecular motions are captured by this two-dimensional space. Performing PCA on the aligned Cartesian coordinates gives very similar results, which are shown in the ESI.†
Fig. 4 shows that the most significant principal component (PC1) corresponds to motion of the hydrogen atom between the two carbonyl oxygens and alternating single and double bond character of the two C–C bonds. The second most significant principal component (PC2) corresponds predominantly to inward motion of the carbonyl oxygens, where the oxygens are farthest apart in the reactant and product structures and closest together at the transition state structure. For videos of these PCs, see ; https://vimeo.com/335614575 for PC1 and ; https://vimeo.com/335614565 for PC2. See the ESI† for corresponding xyz files of these PCs.
3.3. SN2 reaction between OH– and CH3F
Our second example is the SN2 reaction between hydroxide ion and fluoromethane, where hydroxide ion attacks the backside of fluoromethane and releases a fluoride ion (Fig. 5). Modelled in the gas phase, along the IRC, the fluoride ion does not dissociate completely, but rather orbits the newly generated methanol until it finds a suitable location to hydrogen bond with the hydroxyl group. This is not, however, the most common scenario in MD trajectories. Only 10% of MD trajectories conducted by Tsutsumi et al. showed the fluoride ion hydrogen bonding with the resultant methanol, while the other 90% had the fluoride dissociating from the system completely.42
In this system, with squared interatomic distances as input to PCA, PC1 accounted for 78.7% of the variance, PC2 for 14.5%, and PC3 for 4.9% (Fig. 6c, below).
Visualizations of the geometric changes along the top two principal components can be found in Fig. 7. PC1 represents a pathway that looks quite similar to the original IRC, where the fluoride ion dissociates from methanol and then orbits around the molecule to interact with the hydroxyl group. For a video of PC1, see ; https://vimeo.com/335614633. PC2 represents an almost periodic motion (as can be seen in Fig. 6a, where PC2 starts at a maximum, reaches a minimum, and then returns near to the same maximum) of methyl group pyramidalization and O–H bond stretching. For a video of PC2, see ; https://vimeo.com/335614625. Corresponding xyz files for these PCs can be found in the ESI.†
New data for a system can also be projected into a defined reduced dimensional space. To illustrate this, a MD trajectory was initiated from the SN2 system's transition state structure (structure B, Fig. 5–7) for 500 steps of 1 fs and propagated in the product direction. As was observed in most of the trajectories calculated by Tsutsumi et al.,42 after dissociating, the fluoride ion did not orbit the resultant methanol and hydrogen bond with the hydroxide group, but rather dissociated completely and did not re-associate for the duration of the trajectory (500 fs). As can be seen in Fig. 8, there is oscillatory movement with the amplitude in the direction of PC1 and almost linear movement in the direction of PC2. This oscillation reflects the excess energy in the forming C–O bond vibration (reflected in PC1) and progression along PC2 is consistent with the C–F distance increasing. Though this reduced dimensional space was defined only by the structures along the SN2 IRC, it can be quickly seen from the projection of an MD trajectory in the reduced dimensional space that the dynamical path is very different than the IRC path. In addition to showing that MD trajectory paths can be very different from IRC paths, this example illustrates that PathReducer can be used as a straightforward way to classify reaction pathways generated by different types of molecular simulations. Plots of the results when using aligned Cartesian coordinates to represent the molecular structures can be found in the ESI† and look similar to those generated when using squared interatomic distances as input to PCA.
3.4. Torsions in the N2O–acrylonitrile complex
One of the biggest issues that was found in this study with using aligned Cartesian coordinates as input to PCA rather than interatomic distances is how poorly non-linear motions (e.g., torsions) are represented in individual principal components. To illustrate this point, we looked at the dihedral rotation around the C–O bond of a N2O–acrylonitrile complex. We chose this system as one that could be interesting to view in reduced dimensions because we posit that this rotation would be a geometric feature that could, in principle, discriminate between two possible reactive pathways: epoxidation or 1,3-dipolar cycloaddition (Fig. 9).
Fig. 10 shows the IRC projected onto the reduced dimensional space. In this case, two principal components are enough to describe over 99% of the variance in the system. However, using interatomic distances to represent the structures as input to PCA resulted in the first principal component accounting for 93.3% of the variance in the system, whereas an aligned Cartesian coordinates representation of the structures meant the first principal component only accounted for 82.0% of the variance. This result implies that interatomic distance matrices as input to PCA are better for handling torsions in a smaller number of principal components. Thus, if torsions are suspected to be one of the major types of geometric changes along the course of an IRC or trajectory, using the “Distances” input option is likely a better choice (though, if possible, both methods should be screened).
This point can be illustrated by examining the effects of the top principal components on the geometries along the acrylonitrile scan. When performing PCA on the aligned Cartesian coordinates, PC1 significantly compresses the N2O moiety during the torsion in order to emulate the effect of a dihedral rotation, while this is not the case when using squared interatomic distances. This is particularly evident in the middle frames shown in Fig. 11a and b. Similarly, a squared interatomic distances representation more accurately preserves the bond distances of the N2O moiety (Fig. 11c). See ; https://vimeo.com/336110236 for a video of PC1 using aligned Cartesian coordinates as input to PCA and ; https://vimeo.com/335614657 for PC1 using interatomic distances as input to PCA.
3.5. Post-transition state bifurcation in cyclopropylidene ring-opening
The final example to illustrate the utility of this method is a system that exhibits a post-transition state bifurcation.46,47 This particular system is the ring-opening of cyclopropylidene to generate chiral allenes, which follows up on a reaction previously studied by two of us, investigating the effects of explicit solvent on enantiomeric induction. In the previous study, the concerted, asynchronous transition state structure for the ring-opening event was preceded by N2 departure from the carbene carbon, as depicted in Fig. 12.43 The system sans N2 was chosen to focus on the structural changes along the reaction coordinate of the carbon skeleton (including fluorines).
Systems with post-transition state bifurcations occur in cases where a single transition state structure connects a reactant to two separate products, without any intermediate minima or secondary barriers along the downhill path to either product. If one were to take the upper saddle point structure on the PES as the transition state structure and follow the steepest descent path in the reactant and product directions, where two products are related by symmetry (e.g., enantiomers) the steepest descent path on the product side would pass by a valley-ridge inflection (VRI) point before reaching a minimum. In the case of unsymmetrical bifurcations, there would not be a VRI, but still an additional exit channel with no intervening minima or barriers to overcome. In either case, the IRC would not illustrate the connection between the saddle point and the second possible minimum, as, mathematically, there can only be one steepest descent path. We chose this system to test as input to PathReducer because bifurcating reactions represent a class of chemical change whose dynamics are often important, but which have very rarely been visualized using actual structural data and are more often illustrated on qualitative surfaces that illustrate the location of a VRI.46,48–54
While an IRC calculation necessarily picks a single pathway as the minimum energy path, a “bifurcating” IRC could in this case be constructed by a 180° torsion about the C1–C3 axis (see Fig. 13) for each structure following the branching point. Note that reflecting each point along the IRC after the point where the paths split would artificially change the atom labels and would cause the distance matrices for the pathway to each product to be identical, and thus would not be able to show the paths splitting. To avoid this, we keep the atom labels consistent with those that would be obtained by a torsional rotation.
Fig. 14 illustrates that representing structures along symmetric bifurcating reaction paths using interatomic distance matrices does a good job of illustrating the path “splitting” before leading to the two possible products, whose locations are shown by the yellow ends of the paths. The top three principal components account for 77.6%, 11.8%, and 10.0% of the variance in the IRC, respectively. The equivalent plot using the “Cartesians” input to PCA can be found in the ESI.†
As with the SN2 system, MD trajectories for the cyclopropylidene bifurcation were initiated from the transition state structure and propagated in the product direction. Fig. 15 shows these trajectories projected into the reduced dimensional space defined by the bifurcating IRC. The MD trajectories do not follow the IRC path very closely, indicating that dynamic properties of molecules should not be deduced from IRCs alone. Assigning the product made in each case (if a product is even made) is not entirely straightforward, as illustrated in the original trajectory videos (found at ; https://vimeo.com/336131095 for trajectory A, ; https://vimeo.com/336131066 for trajectory B, ; https://vimeo.com/336131042 for trajectory C, and ; https://vimeo.com/336131137 for trajectory D. See the ESI† for corresponding xyz files). However, projecting these trajectories into the reduced dimensional space defined by the IRC enables rapid qualitative insights into the routes taken by any particular trajectory. Fig. 15a shows a trajectory in which the cyclopropyl ring opens but lingers in the bifurcation region without committing to a clear product pathway. Fig. 15b and c show trajectories which are heading toward generating a single product (enantiomer 2). Fig. 15d is rather different: it goes along the pathway toward enantiomer 1 before traversing the region between the two possible products, a consequence of the fact that the trajectories illustrated in Fig. 15 are run in the gas phase at a constant total energy (NVE ensemble). Therefore, once the molecule goes down the potential energy “hill” after the transition state structure, the molecule has significant excess energy with nowhere to dissipate, which enables interconversion between different product states through high energy geometries.
Seeing MD trajectories projected into a reduced dimensional space defined by an IRC in this way offers a unique perspective on the utility of IRCs compared to MD simulations. While MD trajectories arguably model real, room temperature reactions more accurately by including the effects of finite energy and temperature, this kinetic energy adds noise to the pathway from reactant to product(s). An IRC, however, shows the minimum energy pathway from reactant to product(s); viewed another way, the IRC is the minimum atomic motion necessary for a transformation. In this sense, the IRC provides a sort of “skeleton” characterizing the transformation of interest, which is very useful to aid in product classification of MD trajectories. Defining a reduced dimensional space based on an IRC and projecting MD trajectories into this space offers a simple and efficient way to characterize the pathways of MD trajectories in a quantitative comparison to the IRC.
4. Conclusions and future work
In conclusion, we have generated a procedure and written software for dimensionality reduction of reaction pathways that is generalizable and can handle specific chemical problems (e.g., torsions and bifurcations). For several examples, we were able to show that this method can reduce the dimensionality of a complex chemical system to a much smaller number of dimensions. For all of the applications outlined herein, two or three dimensions was sufficient to reconstruct the reaction pathway without losing too much information about the structural variance. The principal components generated as a result of this dimensionality reduction method are linear combinations of (potentially mass-weighted) aligned Cartesian coordinates or interatomic distances. For the example systems described, the interatomic distances representation of structures was better than aligned Cartesian coordinates to describe non-linear structural movements, such as torsions. In the future, we plan to use this methodology to choose collective variables to be used in free energy sampling workflows such as metadynamics or boxed molecular dynamics (BXD).12 We will also analyze various different types of trajectories [e.g., MD trajectories incorporating explicit solvent, non-adiabatic MD trajectories, gas-surface scattering MD trajectories, user-generated pathways from interactive molecular dynamics in virtual reality (iMD-VR)]. Finally, we would also like to make the code for this method more efficient in order to be better able to analyze enzyme–substrate systems, as similar methods of describing proteins as internal distance matrices have already been utilized.23,55 Our hope is that PathReducer will prove useful for mapping out reaction pathways, as an alternative to relying on chemical intuition to determine geometric changes that are most important along an IRC or trajectory. While improvements are ongoing, we are confident in the broad utility of dimensionality reduction of chemical systems and believe it has the potential to form a useful tool for molecular analysis within the whole of the molecular simulation community.
Conflicts of interest
There are no conflicts to declare.
Supplementary Material
Acknowledgments
SRH would like to thank Jonathan Barnoud, Alex Jamieson-Binnie, and Mike O'Connor for helpful discussions in the process of writing the PathReducer software, as well as Rob Arbon for discussions of dimensionality reduction and PCA. We would also like to acknowledge the following packages: NumPy,56 Matplotlib,57 and VMD.58 All authors acknowledge support of this work through EPSRC grant EP/P021123/1. LAB thanks The Alan Turing Institute under the EPSRC grant EP/N510129/1. DRG acknowledges funding from the Royal Society as a University Research Fellow, and also from EPSRC grant EP/M022129/1.
Footnotes
†Electronic supplementary information (ESI) available: The ESI contains the following input xyz files and PathReducer output xyz files, for mass-weighted and not mass-weighted coordinates and for both “Cartesians” and “Distances” inputs to PCA: (1) input files: (a) malonaldehyde system IRC, (b) SN2 system (i) IRC (ii) MD trajectory, (c) N2O–acrylonitrile system dihedral scan, (d) cyclopropylidene bifurcation system (i) IRC (ii) four MD trajectories (A–D). (2) Output files: (a) malonaldehyde system (i) PC1, PC2, PC3 (ii) PCall (PC1–3 combined), (b) SN2 system (i) PC1, PC2, PC3 for IRC (ii) PCall (PC1–3 combined) for IRC (iii) PC1, PC2, PC3 for MD trajectory (iv) PCall (PC1–3 combined) for MD trajectory, (c) N2O–acrylonitrile system (i) PC1, PC2, PC3 using (ii) PCall (PC1–3 combined), (d) cyclopropylidene bifurcation system (i) PC1, PC2, PC3 for IRC (ii) PCall (PC1–3 combined) for IRC (iii) PC1, PC2, PC3 for four MD trajectories (A–D) (iv) PCall (PC1–3 combined) for four MD trajectories (A–D). Relevant plots for all of these systems and all possible input combinations are also included in the ESI. The input file for doing BOMD simulations in Gaussian 09 is also included. See DOI: 10.1039/c9sc02742d
References
- The type of surface being referred to here can be seen in ; (a) Hare S. R., Tantillo D. J. Chem. Sci. 2017;8:1442–1449. doi: 10.1039/c6sc03745c. [DOI] [PMC free article] [PubMed] [Google Scholar]; (b) Hare S. R., Pemberton R. P., Tantillo D. J. J. Am. Chem. Soc. 2017;139:7485–7493. doi: 10.1021/jacs.7b01042. [DOI] [PubMed] [Google Scholar]; (c) Hare S. R., Li A., Tantillo D. J. Chem. Sci. 2018;9:8937–8945. doi: 10.1039/c8sc02653j. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu C., Kelley C. T., Jakubikova E. J. Phys. Chem. A. 2019;123:4543–4554. doi: 10.1021/acs.jpca.9b02298. [DOI] [PubMed] [Google Scholar]
- Xue X. S., Jamieson C. S., Garcia-Borras M., Dong X., Yang Z., Houk K. N. J. Am. Chem. Soc. 2019;141:1217–1221. doi: 10.1021/jacs.8b12674. [DOI] [PubMed] [Google Scholar]
- Yang Z., Yu P., Houk K. N. J. Am. Chem. Soc. 2016;138:4237–4242. doi: 10.1021/jacs.6b01028. [DOI] [PubMed] [Google Scholar]
- Xu L., Doubleday C. E., Houk K. N. J. Am. Chem. Soc. 2010;132:3029–3037. doi: 10.1021/ja909372f. [DOI] [PubMed] [Google Scholar]
- Jimenez-Oses G., Liu P., Matute R. A., Houk K. N. Angew. Chem., Int. Ed. 2014;53:8664–8667. doi: 10.1002/anie.201310237. [DOI] [PubMed] [Google Scholar]
- Patel A., Chen Z., Yang Z., Gutierrez O., Liu H. W., Houk K. N., Singleton D. A. J. Am. Chem. Soc. 2016;138:3631–3634. doi: 10.1021/jacs.6b00017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Noey E. L., Yang Z., Li Y., Yu H., Richey R. N., Merritt J. M., Kjell D. P., Houk K. N. J. Org. Chem. 2017;82:5904–5909. doi: 10.1021/acs.joc.7b00878. [DOI] [PubMed] [Google Scholar]
- Kästner J. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2011;1:932–942. [Google Scholar]
- Barducci A., Bonomi M., Parrinello M. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2011;1:826–843. [Google Scholar]
- Booth J., Vazquez S., Martinez-Nunez E., Marks A., Rodgers J., Glowacki D. R., Shalashilin D. V. Philos. Trans. R. Soc., A. 2014;372:20130384. doi: 10.1098/rsta.2013.0384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O'Connor M., Paci E., McIntosh-Smith S., Glowacki D. R. Faraday Discuss. 2016;195:395–419. doi: 10.1039/c6fd00138f. [DOI] [PubMed] [Google Scholar]
- Allen R. J., Valeriani C., Rein Ten Wolde P. J. Phys.: Condens. Matter. 2009;21:463102. doi: 10.1088/0953-8984/21/46/463102. [DOI] [PubMed] [Google Scholar]
- Faradjian A. K., Elber R. J. Chem. Phys. 2004;120:10880–10889. doi: 10.1063/1.1738640. [DOI] [PubMed] [Google Scholar]
- Lotfi E., Keshavarz A. Comput. Biol. Med. 2014;54:180–187. doi: 10.1016/j.compbiomed.2014.09.008. [DOI] [PubMed] [Google Scholar]
- Wongchenko M. J., McArthur G. A., Dreno B., Larkin J., Ascierto P. A., Sosman J., Andries L., Kockx M., Hurst S. D., Caro I., Rooney I., Hegde P. S., Molinero L., Yue H., Chang I., Amler L., Yan Y., Ribas A. Clin. Cancer Res. 2017;23:5238–5245. doi: 10.1158/1078-0432.CCR-17-0172. [DOI] [PubMed] [Google Scholar]
- Wang S.-L., Li M. and Wang H., Using 2D Principal Component Analysis to Reduce Dimensionality of Gene Expression Profiles for Tumor Classification, in Bio-Inspired Computing and Applications, ICIC 2011, ed. D. S. Huang, Y. Gan, P. Premaratne and K. Han, Lecture Notes in Computer Science, Berlin, Heidelberg, 2012, vol. 6840. [Google Scholar]
- Hemmateenejad B., Miri R., Elyasi M. J. Theor. Biol. 2012;305:37–44. doi: 10.1016/j.jtbi.2012.03.028. [DOI] [PubMed] [Google Scholar]
- Vieira J. B., Braga F. S., Lobato C. C., Santos C. F., Costa J. S., Bittencourt J. A., Brasil D. S., Silva J. O., Hage-Melim L. I., Macedo W. J., Carvalho J. C., Santos C. B. Molecules. 2014;19:10670–10697. doi: 10.3390/molecules190810670. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yoo C., Shahlaei M. Chem. Biol. Drug Des. 2018;91:137–152. doi: 10.1111/cbdd.13064. [DOI] [PubMed] [Google Scholar]
- Shahlaei M., Madadkar-Sobhani A., Fassihi A., Saghaie L., Arkan E. Med. Chem. Res. 2011;21:3246–3262. [Google Scholar]
- Amadei A., Linssen A. B. M., Berendsen H. J. C. Proteins: Struct., Funct., Genet. 1993;17:412–425. doi: 10.1002/prot.340170408. [DOI] [PubMed] [Google Scholar]
- Woods C. J., Malaisree M., Pattarapongdilok N., Sompornpisut P., Hannongbua S., Mulholland A. J. Biochemistry. 2012;51:4364–4375. doi: 10.1021/bi300561n. [DOI] [PubMed] [Google Scholar]
- Shkurti A., Goni R., Andrio P., Breitmoser E., Bethune I., Orozco M., Laughton C. A. SoftwareX. 2016;5:44–50. [Google Scholar]
- Jolliffe I. T., Cadima J. Philos. Trans. R. Soc., A. 2016;374:20150202. doi: 10.1098/rsta.2015.0202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Perez-Hernandez G., Paul F., Giorgino T., De Fabritiis G., Noe F. J. Chem. Phys. 2013;139:015102. doi: 10.1063/1.4811489. [DOI] [PubMed] [Google Scholar]
- Lever J., Krzywinski M., Altman N. Nat. Methods. 2017;14:641–642. doi: 10.1038/nmeth.4526. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Molgedey L., Schuster H. G. Phys. Rev. Lett. 1994;72:3634–3637. doi: 10.1103/PhysRevLett.72.3634. [DOI] [PubMed] [Google Scholar]
- Naritomi Y., Fuchigami S. J. Chem. Phys. 2011;134:065101. doi: 10.1063/1.3554380. [DOI] [PubMed] [Google Scholar]
- Coifman R. R., Lafon S., Lee A. B., Maggioni M., Nadler B., Warner F., Zucker S. W. Proc. Natl. Acad. Sci. U. S. A. 2005;102:7426–7431. doi: 10.1073/pnas.0500334102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coifman R. R., Lafon S., Lee A. B., Maggioni M., Nadler B., Warner F., Zucker S. W. Proc. Natl. Acad. Sci. U. S. A. 2005;102:7432–7437. doi: 10.1073/pnas.0500896102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coifman R. R., Lafon S. Appl. Comput. Harmon. Anal. 2006;21:5–30. [Google Scholar]
- Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay É. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- Buitinck L., Louppe G., Blondel M., Pedregosa F., Mueller A., Grisel O., Niculae V., Prettenhofer P., Gramfort A., Grobler J., Layton R., Vanderplas J., Joly A., Holt B. and Varoquaux G., 2013, arXiv: abs/1309.0238v1.
- Kromann J. C., Calculate Root-Mean-Square Deviation (RMSD) of Two Molecules Using Rotation, https://github.com/charnley/rmsd/commit/cd8af49, accessed May 2019.
- Kabsch W. Acta Crystallogr., Sect. A: Cryst. Phys., Diffr., Theor. Gen. Crystallogr. 1976;32:922–923. [Google Scholar]
- It is worth noting that the term “goodness of fit” might be misleading, as the proportion of variance described by the model is not the only factor that determines what is a well-fitting model. The term is being used here for the sake of consistency with the literature
- Dokmanić I., Parhizkar R., Ranieri J., Vetterli M. IEEE Signal Process. Mag. 2015;32:12–30. [Google Scholar]
- Glunt W., Hayden T. L., Liu W.-M. Bull. Math. Biol. 1991;53:769–796. doi: 10.1007/BF02461553. [DOI] [PubMed] [Google Scholar]
- Cieplak T., Wisniewski J. L. Molecules. 2001;6:915–926. [Google Scholar]
- A recent paper by Tsutsumi et al. presented a dimensionality reduction method using Classical Multidimensional Scaling (CMDS). They employ CMDS on both IRCs and global reaction route maps to generate two-dimensional mappings of reaction pathways. The main difference between CMDS and PCA is the nature of the input to each method: PCA uses numerical variables (sometimes referred to as “features” of the data set) available for the input data elements (with each element being, for example, a set of Cartesian coordinates or the corresponding matrix of squared interatomic distances) whereas CMDS uses distances between data elements (e.g., distances between different sets of Cartesian coordinates). The goal of CMDS is to take a matrix of squared distances between data points and project the points in a lower dimensional space that preserves the distances between those points the best. A procedure like this does provide useful information about relative locations between data points, but does not inherently provide anything beyond a mapping. CMDS is also not typically used to transform the data in reduced dimensions back into the full-dimensional space
- Tsutsumi T., Ono Y., Arai Z., Taketsugu T. J. Chem. Theory Comput. 2018;14:4263–4270. doi: 10.1021/acs.jctc.8b00176. [DOI] [PubMed] [Google Scholar]
- Carpenter B. K., Harvey J. N., Glowacki D. R. Phys. Chem. Chem. Phys. 2015;17:8372–8381. doi: 10.1039/c4cp05078a. [DOI] [PubMed] [Google Scholar]
- Frisch M. J., Trucks G. W., Schlegel H. B., Scuseria G. E., Robb M. A., Cheeseman J. R., Scalmani G., Barone V., Mennucci B., Petersson G. A., Nakatsuji H., Caricato M., Li X., Hratchian H. P., Izmaylov A. F., Bloino J., Zheng G., Sonnenberg J. L., Hada M., Ehara M., Toyota K., Fukuda R., Hasegawa J., Ishida M., Nakajima T., Honda Y., Kitao O., Nakai H., Vreven T., Montgomery J. A., Peralta J. E., Ogliaro F., Bearpark M., Heyd J. J., Brothers E., Kudin K. N., Staroverov V. N., Kobayashi R., Normand J., Raghavachari K., Rendell A., Burant J. C., Iyengar S. S., Tomasi J., Cossi M., Rega N., Millam J. M., Klene M., Knox J. E., Cross J. B., Bakken V., Adamo C., Jaramillo J., Gomperts R., Stratmann R. E., Yazyev O., Austin A. J., Cammi R., Pomelli C., Ochterski J. W., Martin R. L., Morokuma K., Zakrzewski V. G., Voth G. A., Salvador P., Dannenberg J. J., Dapprich S., Daniels A. D., Farkas O., Foresman J. B., Ortiz J. V., Cioslowski J. and Fox D. J., Gaussian 09, Revision D.02, Wallingford, CT, 2009.
- Pople J. A., Raghavachari K., Schlegel H. B., Binkley J. S. Int. J. Quantum Chem. 1979;16:225–241. [Google Scholar]
- Ess D. H., Wheeler S. E., Iafe R. G., Xu L., Çelebi-Ölçüm N., Houk K. N. Angew. Chem., Int. Ed. 2008;47:7592–7601. doi: 10.1002/anie.200800918. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hare S. R., Tantillo D. J. Pure Appl. Chem. 2017;89:679–698. [Google Scholar]
- Birney D. M. Curr. Org. Chem. 2010;14:1658–1668. [Google Scholar]
- Bogle X. S., Singleton D. A. Org. Lett. 2012;14:2528–2531. doi: 10.1021/ol300817a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Collins P., Carpenter B. K., Ezra G. S., Wiggins S. J. Chem. Phys. 2013;139:154108. doi: 10.1063/1.4825155. [DOI] [PubMed] [Google Scholar]
- Hare S. R., Tantillo D. J. Beilstein J. Org. Chem. 2016;12:377–390. doi: 10.3762/bjoc.12.41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maeda S., Harabuchi Y., Ono Y., Taketsugu T., Morokuma K. Int. J. Quantum Chem. 2015;115:258–269. [Google Scholar]
- Sheppard A. N., Acevedo O. J. Am. Chem. Soc. 2009;131:2530–2540. doi: 10.1021/ja803879k. [DOI] [PubMed] [Google Scholar]
- Siebert M. R., Manikandan P., Sun R., Tantillo D. J., Hase W. L. J. Chem. Theory Comput. 2012;8:1212–1222. doi: 10.1021/ct300037p. [DOI] [PubMed] [Google Scholar]
- Ernst M., Sittel F., Stock G. J. Chem. Phys. 2015:143. doi: 10.1063/1.4938249. [DOI] [PubMed] [Google Scholar]
- Oliphant T. E., Guide to NumPy, Trelgol Publishing, United States, 2006. [Google Scholar]
- Hunter J. D. Comput. Sci. Eng. 2007;9:90–95. [Google Scholar]
- Humphrey W., Dalke A., Schulten K. J. Mol. Graphics. 1996;14:33–38. doi: 10.1016/0263-7855(96)00018-5. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.