Abstract
Molecular dynamics (MD) simulations have been widely applied to study macro-molecules including proteins. However, high-dimensionality of the datasets produced by simulations makes it difficult for thorough analysis, and further hinders a deeper understanding of biomacromolecules. To gain more insights into the protein structure-function relations, appropriate dimensionality reduction methods are needed to project simulations onto low-dimensional spaces. Linear dimensionality reduction methods, such as principal component analysis (PCA) and time-structure based independent component analysis (t-ICA), could not preserve sufficient structural information. Though better than linear methods, nonlinear methods, such as t-distributed stochastic neighbor embedding (t-SNE), still suffer from the limitations in avoiding system noise and keeping inter-cluster relations. ivis is a novel deep learning-based dimensionality reduction method originally developed for single-cell datasets. Here we applied this framework for the study of light, oxygen and voltage (LOV) domain of diatom Phaeodactylum tricornutum aureochrome 1a (PtAu1a). Compared with other methods, ivis is shown to be superior in constructing Markov state model (MSM), preserving information of both local and global distances and maintaining similarity between high dimension and low dimension with the least information loss. Moreover, ivis framework is capable of providing new prospective for deciphering residue-level protein allostery through the feature weights in the neural network. Overall, ivis is a promising member in the analysis toolbox for proteins.
Graphical Abstract
Introduction
Molecular dynamics (MD) simulations have been widely used in biomolecules to provide insights into their functions at the atomic-scale mechanisms.1 For this purpose, extensive timescale is generally preferred for the simulations to study protein dynamics and functions. Due to the arising of graphics processing units (GPU) and their application in biomolecular simulations, MD simulation timescale has reached from nanoseconds to experimentally meaningful microseconds.2,3 However, simulation data for biomacromolecues such as proteins are high-dimensional and suffer from the curse of dimensionality,4 which hinders in-depth analysis, including extracting slow time-scale protein motions,5 identifying representative protein structures6 and clustering kinetically similar macrostates.7 In order to make these analyses feasible, it will be informative to construct a low-dimensional space to characterize protein dynamics in the best way possible.
In recent years, new dimensionality reduction algorithms have been developed and can be applied to analyze protein simulations, construct representative distribution in low dimensional space, and extract intrinsic relations between protein structure and functional dynamics. These methods can be generally categorized into linear and nonlinear methods.8,9 Linear dimensionality reduction methods produce new variables as the linear combination of the input variables, such as principal component analysis (PCA)10 and time-structure based independent component analysis (t-ICA).11 Nonlinear methods construct variables through a nonlinear function, including t-distributed stochastic neighbor embedding (t-SNE)12 and auto encoders.13 It is reported that nonlinear methods are more powerful in reducing dimensionality while preserving representative structures.14
Information is inevitably lost to certain degree through the dimensionality reduction process.15 It is expected that the distances among data points in the low dimensional space resemble the original data in the high dimensional space. Markov state model (MSM) is often applied to study the dynamics of biomolecular system. MSM is constructed by clustering states in the reduced dimensional space to catch long-time kinetic information.16 However, many dimensionality reduction methods, such as PCA and t-ICA, fail to keep the similarity characteristics in the low dimension, which would cause a misleading clustering analysis based on the projections of low-dimensional space.17 Therefore, more appropriate dimensionality reduction methods are needed to build proper MSM.
A novel framework, ivis,18 is a recently developed dimensionality reduction method for single-cell datasets. ivis is a nonlinear method based on siamese neural networks (SNNs).19 The SNN architecture consists of three identical neural networks and ranks the similarity to the input data. The loss function used for training process is a triplet loss function20 that calculates the Euclidean distance among data points and simultaneously minimizes the distances between data of the same labels while maximizing the distances between data of different labels. Due to this intrinsic property, ivis framework is capable of preserving both local and global structures in low-dimensional surface.
With the success in single-cell expression data, ivis framework is promising as a dimensionality reduction method for simulations of biomacromolecules to investigate their functional dynamics such as allostery. Diatom Phaeodactylum tricornutum aureochrome 1a (PtAu1a) is a recently discovered light, oxygen, or voltage (LOV) protein from the photosynthetic stramenopile alga Vaucheria frigida.21 This protein consists of an N-terminal domain, a C-terminal LOV core, and a basic region leucine zipper (bZIP) DNA-binding domain. PtAu1a is a monomer in the native dark state. The interaction between its LOV core and bZIP prohibits DNA binding.22 Upon light perturbation, a covalent bond forms between C4a position of cofator flavin monocleotide (FMN) and sulfur in cysteine 287, triggering a conformational change that leads to the LOV domain dimerization. In the current study, the PtAu1a LOV domain (AuLOV) with two flanking helices (A’α and Jα helices) are simulated through MD simulations. The structures of both dark state and light state are shown in Figure 1. The main difference between AuLOV and most other LOV proteins is that LOV domain lies in the C-terminal in AuLOV while in the N-terminal in other LOV proteins.23,24 Therefore, the conformational changes in AuLOV are expected to differ from other LOV protein, raising the question on how the allosteric signal transmits in AuLOV. In the current study, ivis framework, together with other dimensionality reduction methods, is applied to project the AuLOV simulations onto reduced dimensional spaces. The performance of the selected methods are assessed and compared, validating the ivis as a superior framework for dimensionality reduction of biomacromolecule simulations.
Figure 1:
Native structures of AuLOV. (A) Two monomers in the native dark state; (B) A dimer in the native light state. The sequence of secondary structure starts from Ser240 to Glu367.
Methods
Molecular Dynamics (MD) simulations
The crystal structures of AuLOV dark and light states were obtained from Protein Data-Bank (PDB)25 with PDB ID 5dkk and 5dkl, respectively. The light structure sequence starts from Gly234 while the dark structure sequence starts from Phe239 in chain A and Ser240 in chain B. For consistency, residues before Ser240 were removed to keep the same number of residues in all chains. Therefore, simulations of dark state and light state can be treated similarly. Both structures contain FMN as cofactor. The FMN force field from a previous study26 was used in this study. Two new states, named as transient dark state (forcing the Cysteinyl-Flavin C4a adduct in the dark state structure) and transient light state (breaking the Cysteinyl-Flavin C4a adduct in the light state structure) were constructed to fully explore the protein conformational space. Two monomers (Figure 1A) and a dimer (Figure 1B) were simulated in the dark states and light states, respectively.
The crystal structures with added hydrogen atoms were solvated within a rectangular water box using the TIP3P water model.27 Sodium and chlorine ions were added for charge neutralization. Energy minimization was done for each water box. The system was further subjected to 20 picoseconds (ps) of MD simulations to raise temperature from 0K to 300K and another 20ps simulations for equilibrium. 10 nanoseconds (ns) of isothermal-isobaric ensemble (NPT) MD simulation under 1 bar pressure were conducted. Canonical ensemble (NVT) is usually applied in the production runs to investigate the allosteric process28,29. 1.1 microseconds (μs) of canonical ensemble (NVT) Langevin MD simulation at 300K was carried out for each production run. The Langevin dynamics friction coefficient that couples the system to heat bath, was set to 1 ps−1,30,31 with minimum perturbation to the dynamical properties of the protein system.32 For all production simulations, the first 100ns simulation is treated as equilibration stage and not included in the analysis. For each structure, three independent MD simulations were carried out and a total of 12 μs simulations were used in analysis. All chemical bonds associated with hydrogen atom were constrained with SHAKE method. 2 femtoseconds (fs) step size was used and simulation trajectories were saved for every 100ps. Periodic boundary condition (PBC) was applied in simulations. Electrostatic interactions were calculated with particle mesh Ewald (PME) algorithm33 and a cutoff of 1.2 nanometers (nm). Simulations were conducted using graphics processing unit accelerated calculations of OpenMM34 with CHARMM35 simulation package version c41b1 and CHARMM27 force field.36
Feature Processing
In MD simulations, protein structures are represented as atom positions in Cartesian coordinates. However, this representation is neither rotation invariant nor feasible for analysis purpose due to the significant number of atoms with total of 3N degrees of freedom. In order to represent the protein structures with rotational invariance and essential structural information, pair-wised backbone Cα distances were selected to represent the overall protein configuration. Following our previously proposed feature processing method,37 distances were encoded as a rectified linear unit (ReLU)38-like activation function and further expanded as a vector.
(1) |
Dimentionality Reduction Methods
ivis
ivis is a deep learning-based method for structure-preserving dimensionality reduction. This framework is designed using siamese neural networks, which implement a novel architecture to rank similarity among input data. Three identical networks are included in the SNN. Each network consists of three dense layers and an embedding layer. The size of the embedding layer was set to 2, aiming to project high-dimensional data into a 2D space. Scaled exponential linear units (SELUs)39 activation function is used in the dense layers,
(2) |
The LeCun normal distribution is applied to initialize the weights of these layers. For the embedding layer, linear activation function is used, and weights are initialized with Glorot’s uniform distribution. In order to avoid overfitting, dropout layers with a default dropout rate of 0.1 are used for each dense layer.
A triplet loss function is used as the loss function for training,
(3) |
where a, p, n are anchor points, positive points, negative points, respectively. D and m are Euclidean distance and margin, respectively. Anchor points are points of interest. The triplet loss function aims to minimize the distance between anchor points and positive points while maximizing the distances between anchor points and negative points. The distance between positive points and negative points are also taken into account, as shown in min (Da,n, Dp,n) in the above equation.
The k-nearest neighbors (KNNs) are used to obtain data for the triplet loss function. k is a tuning parameter and is set to 100. For each round of calculation, one point in the dataset is selected as an anchor. A positive point is randomly selected among the nearest k neighbors around the anchor, and a negative point is randomly selected outside the neighbors. For each training epoch, the triplet selection is updated to maximize the differences in both local and global distances.
If the date set could be classified into different groups based on their intrinsic properties, ivis can also be used as a supervised learning method by combining the distance-based triplet loss function with a classification loss. Supervision weight is a tuning parameter to control the relative importance of loss function in labeling classification.
The neural network is trained using Adam optimizer function with a learning rate of 0.001. Early stopping is a method to prevent overfitting in training neural network and is applied in this study to terminate the training process if loss function does not decrease after 10 consecutive epochs.
Time-structure Independent Components Analysis (t-ICA)
t-ICA method finds the slowest motion or dynamics in molecular simulations and is commonly used as dimensionality reduction method for macromolecular simulations.11 For a given n-dimensional data, t-ICA is employed by solving the following equation:
(4) |
where K is eigenvalue matrix and F is the eigenvector matrix. is the time lag correlation matrix defined as
(5) |
The results calculated by t-ICA are linear combinations of input features that are highly autocorrelated.
Principal Component Analysis (PCA)
PCA is a method that finds the projection vectors that maximize the variance by conducting an orthogonal linear transformation.10 In the new coordinate system, the greatest variance of the data lies on the first coordinate and is called the first principal component. Principal components can be solved through the singular value decomposition (SVD).40 Given data matrix X, the covariance matrix can be calculated as:
(6) |
where n is the number of samples. C is a symmetric matrix and can be diagonalized as:
(7) |
where V is a matrix of eigenvectors and L is a diagonal matrix with eigenvalues λi in descending order.
t-Distributed Stochastic Neighbor Embedding (t-SNE)
t-SNE is a nonlinear dimentionality reduction method that tries to embed similar objects in high dimensions to points close to each other in a low dimension space.12 t-SNE has been demonstrated as a suitable dimensionality reduction method for protein simulations.41 The calculation process consists of two stages. First, conditional probability is calculated to represent the similarity between two objects as:
(8) |
where σi is the bandwidth of the Gaussian kernels.
While the conditional probability is not symmetric since pj|i is not equal to pi|j, the joint probability is defined as:
(9) |
In order to better represent the similarity among objects in the reduced map, the similarity qij is defined as:
(10) |
Combined with the joint probability pij and similarity qij, Kullback–Leibler (KL) divergence is used to determine the coordinates of yi as:
(11) |
The KL divergence measures the differences between high-dimensional data and low-dimensional points, which is minimized through gradient descent method.
A drawback of traditional t-SNE method is the slow training time. In order to speed up the computational time of dimensionality reduction process, Multicore t-SNE42 is used and abbreviated as t-SNE in this study.
Performance Assessment Criteria
Several assessment criteria are applied to quantify and compare the performance of each dimensionality reduction method.
Root-Mean-Square Deviation (RMSD)
The RMSD is used to measure the conformational change in each frame with regard to a reference structure. Given a molecular structure, the RMSD is calculated as:
(12) |
where r is a vector represented in Cartesian coordinates and is the ith atom in the reference structure.
Pearson Correlation Coefficient (PCC)
Pearson correlation coefficient43 reflects the linear correlation between two variables. PCC has been rigorously applied to estimate the linear relation between distances in the original space and the reduced space.44 L2 distance, which is also called Euclidean distance, is used for the distance calculation and is shown as follows:
(13) |
Based on the L2 distance expression, PCC is calculated as:
(14) |
where n is the sample size, xi, yi, , are the distances and the mean value of distances, respectively.
Spearman’s Rank-Order Correlation Coefficient
Spearman’s rank-order correlation coefficient is used to quantitatively analyze how well distances between all pairs of points in the original spaces have been preserved in the reduced dimensions. Specifically, Spearman correlation coefficient measures the difference in distance ranking, which is calculated as the following:
(15) |
where di is the difference in paired ranks and n equals the total number of samples.
Mantel Test
The Mantel test is a non-parametric method that is originally used in genetics,45 which tests the correlation between two distance matrices. A common problem in evaluating the correlation coefficient is that distances are dependent to each other and therefore cannot be determined directly. The Mantel test overcomes this obstacle through permutations of the rows and columns of one of the matrices. The correlation between two matrices is calculated at each permutation. MantelTest GitHub repository46 was used to implement the algorithm.
Shannon Information Content (IC)
While chemical information in the original space could be lost to a certain degree in the reduced space, dimensionality reduction methods are expected to keep the maximum information. Shannon information content is applied to test the information preservation in the reduced space, which is defined as:
(16) |
where P is the probability of a specific event x. To avoid the possible dependency among different features in the reduced dimensions, original space was reduced to 1 dimension (1D) to calculate the IC. The values in the 1D was sorted and put into 100 bins of the same length. The bins were treated as events and the corresponding probabilities were calculated as the ratio of the number of samples in each bin to the total number of samples.
Markov State Model (MSM)
Markov state model has been widely used to partition the protein conformational space into kinetically separated macrostates47 and estimate relaxation time to construct long-timescale dynamics behavior.6 MSMBuilder48 (version 3.8.0) was employed to implement the Markov state model. k-Means clustering method was used to cluster 1, 000 microstates. A series of lag time at equal interval was set to calculate the transition matrix. The corresponding second eigenvalue was used to estimate the relaxation timescale, which was calculated as:
(17) |
where λ1 is the second eigenvalue and τ is the lag time.
The generalized matrix Rayleigh quotient (GMRQ)49, generated using the combination of cross-validation and variational approach, was used to assess the effectiveness of MSM on dimensions and dimensionality reduction methods. State decompositions are different through various dimensionality reduction methods. A good method is expected to lead to a Markov state model with larger GMRQ value.
Machine Learning Methods
Random Forest (RF)
Random forest50 is a supervised machine learning method that was used in this study for trajectory states classification. A random forest model consists of multiple decision trees, which are a class of partition algorithm that recursively groups data samples of the same label. Features at each split are selected based on the information gain. A final prediction result of random forest is made from results in each decision tree through voting algorithm. For random forest models at each depth, the number of decision tree was set to 50. Scikit-learn (version 0.20.1)51 was used for RF implementation.
Artificial Neural Network (ANN)
An artificial neural network was used to learn the nonlinear relationships of coordinates on the reduced 2D dimension. An ANN is generally formed with input layer, hidden layer and output layer. In each layer, different neurons (nodes) are assigned and connected with adjacent layer(s). During the training process, input data are fed through the input and hidden layers and prediction results are made in the output layer. For each training step, the error between the predicted result and the actual result is propagated from the output layer back to the input layer, which is also called back propagation,52 and the weight in every neuron is updated. When there is more than one hidden layer, ANN is also referred as deep neural network (DNN), which requires more computation power. To minimize training cost, only two hidden layers, each with 64 nodes, were used. Adam optimizer53 was used for weight optimization. ANN was implemented with Keras (version 2.2.4-tf).54
Results
Dataset of Cα distances represents the protein structures
There are two native states of AuLOV: native dark state and native light state. To explore the protein response with regard to the formation of the covalent bond between cysteine 287 and FMN, two new states were constructed as transient dark state and transient light state by forcing the Cysteinyl-Flavin C4a adduct in the native dark state and breaking this adduct in the native light state, respectively. The RMSDs of MD simulations are plotted in Figure 2. For each trajectory, the RMSD values were calculated with regard to the first frame. Averaged RMSDs were 1.75Å, 2.04Å, 2.39Å and 2.08Å in native dark, transient dark, transient light and native light states, respectively. Compared with the result in native dark state, the higher RMSD value in transient dark state indicates that the light-induced covalent bond Cys287-FMN increases the protein flexibility and dynamics. The transient light state displays the highest averaged RMSD value, indicating the highest flexibility or largest conformational change.
Figure 2:
The RMSDs of AuLOV MD trajectories. (A) Native dark and transient dark states; (B) Native light and transient light states. For each state, three independent simulations were carried out.
The pair-wised distances of backbone Cα in simulations were extracted as features representing the character of protein configurations. There are 254 residues in the AuLOV structure and total of 254 * 253/2 = 32, 131 Cα distances were calculated. Before further analysis, features were transformed into vectors with our proposed technique outlined in the Methods section. Considering the non-bonded chemical interaction, 10.0Å was selected as threshold for feature transformation. There are 10, 000 frames in each trajectory, leading to a sample size of 120, 000 in the overall dataset. Full datasets were applied in all analysis. To gain more statistical significance, each MD trajectory was split into 5 sub-trajectories at equal intervals. The performance assessments were conducted for each sub-trajectory independently. The mean and standard deviation values of the 5 subsets were calculated.
Information is well-preserved in ivis dimensionality reduction method
Several hyperparameters of ivis model were selected based on the recommended values for different observation sizes. Given the large number of sample size, k was set to 100 and the number of early stopping epoch was 10. Maaten neural network architecture was selected, which consists of three dense layers with 500, 500 and 2, 000 neurons, respectively. In order to select the best parameter of supervision weight, full trajectory dataset was randomly split into training set (70%) and testing set (30%). ivis models were trained on the training set and validated on the testing set. The prediction result with different supervision weights is plotted in Figure 3. The ivis model performed poorly at 0.0 supervision weight, which corresponds to unsupervised ivis, with an average accuracy below 25%. The average accuracy values for other supervision weights were stable and over 95%. Specifically, there was no significant increase in the accuracy value after 0.1 supervision weight, which was chosen as the hyperparameter for the supervised ivis model. Unsupervised ivis framework with the same value of other hyperparameters was applied for comparison.
Figure 3:
Classification accuracy using ivis framework with different supervision weights. With 0.0 supervision weight, it is referred to as unsupervised ivis model. Classification accuracy is high for any non-zero supervision weight. Therefore, 0.1 was chosen as the hyperparameter for supervised ivis.
Four dimensionality reduction models (supervised ivis, PCA, t-SNE and t-ICA) were applied on the MD simulations to project high dimensional (32, 131) space to 2D surface (Figure 4). The supervised ivis dimensionality reduction method, as well as PCA, successfully separated dark and light states while keeping the corresponding transition states close (Figure 4A and 4B). These states are important for dynamical analysis as they could be used to reveal the free energy and kinetic transition landscape for target system. For t-SNE (Figure 4C) and t-ICA (Figure 4D) projections, transient dark state and native dark state overlap significantly, thus hindering the extraction of thermodynamics and kinetics information. Therefore, supervised ivis dimensionality reduction method and PCA are demonstrated to be proper in representing the chemical information in the low dimension among the investigated methods.
Figure 4:
2D projections of four dimensionality reduction methods. (A) Supervised ivis; (B) PCA; (C) t-SNE; (D) t-ICA.
The k-means clustering was used in the reduced dimensions to partition a total number of 120, 000 frames from AuLOV MD trajectories into 1, 000 microstates. Within each cluster, the RMSDs were calculated for each structure pair. A RMSD value of each cluster is defined as the average RMSD value among all structure pairs within that cluster. The results of five dimensionality reduction models are shown in Figure 5A. The average RMSD value of an appropriate microstate should be lower than 1.0Å.55,56 From this prospective, unsupervised ivis and supervised ivis show similar values in each microstate and are the best two methods among the selected methods. As reported previously,41 t-SNE also exhibited good performance in measuring the similarity with the Cartesian coordinates.
Figure 5:
Analysis results of 2D projections for different dimensionality reduction methods. (A) The average values of RMSDs in microstates clustered within projected 2D dimensional space. (B) Estimated implied timescales from Markov state models with regard to different lag times. For each model, the mean value of implied timescale is calculated among five subsets and is plotted in solid color. Standard deviation is calculated to show the stability for each lag time and is illustrated using light color.
A metric to compare different dimensionality reduction methods is the implied relaxation timescale calculated from Markov state model. To build MSM, MD simulations were projected onto 2D space and 1, 000 microstates were sampled through k-means with corresponding estimated relaxation timescales. For each method, the slowest timescale in each lag time was extracted based on different lag times ranging from 10 to 100 ns and is shown in Figure 5B. The convergence of timescales is important for eigenvalues and eigenvectors calculation.57 For all five models, relaxation timescales converged, indicating the Markovianity of the MSMs. Both supervised ivis and unsupervised ivis models show long timescales, indicating the effectiveness of MSM built on the reduced spaces.
Euclidean distances between data points in the low dimensional space are expected to reflect the similarity in the high dimension. In order to quantify the degree of this relationship kept in reduced dimensional space, Spearman correlation coefficients were calculated between Euclidean distance pairs in the original space and those in the reduced space. The results are shown in Figure 6A. While PCA preserved the Euclidean distances well with an average of 0.86 coefficient, supervised ivis model showed a comparable high coefficient of 0.84. The unsupervised ivis model also exhibited the ability to preserve the linear relationship. The poor performance of t-SNE model may be due to the reason that t-SNE is a nonlinear method and therefore suffers the problem that distance in the high dimensional space is not linearly projected to low dimensional space, as reported in other studies.58,59
Figure 6:
Results of quantitative analysis. (A) Spearman correlation coefficient results of different dimensionality reduction methods. The mean values for unsupervised ivis, supervised ivis, PCA, t-SNE and t-ICA are 0.69, 0.84, 0.86, 0.41 and 0.70, respectively. The height of each box represents the interquartile range. (B) The Mantel test scores of different dimensionality reduction methods. The mean values for unsupervised ivis, supervised ivis, PCA, t-SNE and t-ICA are 0.83, 0.95, 0.98, 0.15 and 0.89, respectively. (C) Shannon information content of different dimensionality reduction methods. The mean values for unsupervised ivis, supervised ivis, PCA, t-SNE and t-ICA are 449.0, 714.6, 327.0, 707.3 and 54.0, respectively. To avoid dependent variables in information content calculation, high-dimensional Cα distances were projected to 1D. (D) Generalized matrix Rayleigh quotient with different dimensions and dimensionality reduction methods.
While ivis models showed good ability in keeping the linear projection relation, the Spearman correlation coefficient fails to overcome the problem that features are not independent. The pair-wised distances are subjected to the molecular motion of Cα that changing the coordinate of one Cα atom would affect the distances related to this atom. Therefore, to address this issue, the Mantel test was used to randomize the Euclidean distances. Permutations of rows and columns in the Euclidean distance matrix were done for 10, 000 times while Pearson correlation coefficient being calculated at each time. The results of the Mantel test are plotted in Figure 6B. Both unsupervised ivis and supervised ivis showed remarkable results in preserving the correspondence relationship in randomized order, at the mean coefficient of 0.83 and 0.95, respectively.
During the process of dimensionality reduction, information is inevitably lost to some degree. In order to measure the retaining information through the dimensionality reduction process, the Shannon information is applied to the coordinates in the low dimensional space. However, when dealing with multiple variables, especially for the dependent Cα distances, the total Shannon information is not equal to the sum of the Shannon information of each variable. To reduce the computation complexity, high dimensional features were reduced to 1D for calculation and results are plotted in Figure 6C. It shows that supervised ivis model is superior in preserving information content with the least information loss. It is also worth noting that t-SNE showed better performance than the unsupervised ivis model.
To study the performance of Markov state model on dimensions and dimensionality reduction methods, the generalized matrix Rayleigh quotient was calculated for each dimension and method (Figure 6D). The results of four methods showed different trends. Supervised ivis and t-ICA methods were the least and most affected by the number of dimensions, respectively. For PCA and t-SNE, the optimal parameter of the number of dimensions is in the tens. 2D is typically used for MSM construction and visualization purpose60–62. With this regard, supervised ivis exhibited the highest GMRQ value.
ivis helps in understanding biological system and allosteric mechanism
The transition probabilities among macrostates in ivis projections are shown in Figure 7. Based on the similarity with crystal structures, macrostate 2 and 8 are referred to as native dark state and native light state, respectively. Other macrostates are considered as transient states. The probabilities of native dark state and native light state to remain to themselves are among the highest ones and indicate high stability of these two states. It is interesting to observe that marcostate 9 may have the highest stability among all Markov states.
Figure 7:
Transition probabilities between macrostates in the ivis dimensionality reduction 2D projections.
The effectiveness of MSM depends on the projected 2D space, where appropriate discrete states are produced by clustering the original data points in the projection space. The number of macrostates are determined based on the implicated timescales using different lag time in different reduced spaces. In this study, 9, 9, 7, 9, 7 macrostates were selected for unsupervised ivis, supervised ivis, PCA, t-SNE and t-ICA, respectively. The samples were clustered through Perron-cluster cluster analysis (PCCA). Dataset was further split into training set (70%) and testing set (30%). Two machine learning methods (random forest and artificial neural network) were applied to predict the macrostates of each data point based on the pair-wised Cα distances. Prediction accuracy results are plotted in Figure 8A and 8B. It shows that the supervised ivis framework is the best among the five dimensionality reduction methods. Surprisingly, while the unsupervised ivis model was trained without class labels in the loss function, the high prediction accuracy of this model demonstrates its good performance on the 2D projections. Random forest is often applied to distinguish the macrostates, since it provides feature importance, which is important for the interpretation of biological system. The accumulated feature importance of ranfom forest model on the supervised ivis model is plotted in Figure 8C. The top 490 features accounts for 90.2% of the overall feature importance.
Figure 8:
Prediction accuracy of different machine learning models. (A) Random forest and (B) artificial neural network were used on the reduced 2D spaces to predict labels of macrostates from MSM. (C) Accumulated feature importance of random forest model applied in the projections of supervised ivis framework at depth 5.
The high prediction accuracy of the supervised ivis framework suggests that supervised ivis is more promising in elucidating the conformational differences among macrostates. The neural network architecture on the first dense layer of supervised ivis model was 32, 131×500, where 32, 131 and 500 represent the number of Cα distances and dense layers, respectively. In order to identify key residues and structures that are important in the dimensionality reduction process, 32, 131 feature weights on the last layer were treated as the feature importance and shown as the protein contact map in Figure 9. The contact map is symmetrical along the diagonal. The upper triangular part is divided into four regions as region 1: pair wised Cα distances within chain A, region 4: Cα distances within chain B, regions 2 and 3: Cα distances between chain A and B. Our results show similar characteristics with a previous study17. Local protein structures are encoded to features close to the diagonal. Global structures are encoded to features further from the diagonal. In Figure 9, the local information is shown in red rectangular as the Cα and Dα helices in AuLOV system, and global information is shown in black rectangular as the Gβ and Hβ strands. While region 2 (protein interactions from chain A to chain B) and 3 (protein interactions from chain B to chain A) are mostly symmetrical, we found the asymmetrical behavior (red circle in Figure 9) that the interaction between Jα in chain A and linkers in chain B is stronger than the interaction between Jα in chain B and linkers in chain A.
Figure 9:
Protein contact map with corresponding protein structures. Feature weights of the first dense layer in the supervised ivis dimensionality reduction method were extracted and were colored as red (positive), white (close to zero) and blue (negative). The residue sequence starts from Ser240 in chain A and ends in Glu367 in chain B.
To examine the important residues identified in the protein contact map, for each Cα distance, the corresponding feature weight was accumulated to the related two residues. Therefore, the significance of residues and structures are quantified. Top 20 residues were listed in Table 1 with important residues that are experimentally identified22,63–66 shown in bold font. The accumulated importance of secondary structure is shown in Table 2, which shows that A’α helix, Jα helix and protein linkers are important in AuLOV allostery.
Table 1:
Top 20 residues identified in the supervised ivis framework.
Residue | Importance (%) | Residue | Importance (%) |
---|---|---|---|
ILE 242 | 1.12 | PHE 241 | 1.10 |
LEU 245 | 1.07 | ALA 248a | 1.04 |
GLN 250 | 1.01 | SER 314 | 1.00 |
GLN 246 | 0.99 | ASN 251 | 0.98 |
THR 247 | 0.97 | PRO 268 | 0.97 |
ASN 329 | 0.96 | GLN 350 | 0.96 |
MET 313 | 0.95 | ALA 244 | 0.94 |
PHE 331 | 0.93 | ASN 311 | 0.93 |
SER 240 | 0.92 | GLN 365 | 0.92 |
CYS 351 | 0.91 | ALA 335 | 0.90 |
Experimentally confirmed important residues are shown in bold font.
Table 2:
Accumulated feature importance of secondary structures.
Secondary structure | Importance (%) |
---|---|
A’α | 13.17 |
Aβ | 6.34 |
Bβ | 2.36 |
Cα | 8.50 |
Dα | 4.14 |
Eα | 1.40 |
Fα | 5.50 |
Gβ | 6.52 |
Hβ | 8.67 |
Iβ | 7.98 |
Jα | 10.44 |
Linkers | 24.98 |
ivis is more computationally efficient than t-ICA and t-SNE
A key factor in comparing different dimensionality reduction methods is their computational cost, for it could be prohibitively expensive when dealing with large size and high-dimensional dataset. To compare the computational efficiency of different dimensionality reduction methods with regard to sample size and feature size, three randomly generated datasets with uniform distribution between 0 and 1 were applied for each dataset size. The relation between runtime and sample size, with feature size of 1, 000, is shown in Figure 10A. While t-SNE is stable and fast in small datasets (≤ 10, 000 sample size), the runtime grows the fastest among the five models and is not feasible for large dataset. t-ICA and PCA overlapped with each other since these two models are less affected by the sample size. Unsupervised ivis and supervised ivis exhibited similar runtime results. The relation between runtime and feature size with sample size of 10, 000 is shown in Figure 10B. t-ICA and t-SNE show similar trends in the runtime growth trend, as they perform fast in small feature size (≤ 10, 000) but not practical in higher dimensions. While both ivis models are slower than PCA, the runtime of these two models are acceptable for large sample size and high dimension. The training process of supervised ivis is further displayed in Figure 11. Triplet loss was stable after 4 epochs and stopped at 32 epochs with early stopping of 10.
Figure 10:
Computation time of each dimensionality reduction method spent for fitting high-dimensional data. (A) Runtime result of 1, 000 feature size with different sample size. Results of PCA and t-ICA were overlapped because of the timescale. (B) Runtime result of 10, 000 sample size with different feature size.
Figure 11:
Triplet loss of each epoch for supervised ivis framework with supervision weight of 0.1 and early stopping of 10. Model is trained on dataset of 10, 000 samples with 60, 000 dimensions. Dashed grey line indicates the expected termination in training with early stopping of 5.
Discussion
As a deep learning-based algorithm, ivis framework is originally designed for single-cell experiments to provide new approach for visualization and explanation purposes. In this study, ivis is applied on the MD simulations of allosteric protein AuLOV for dimensionality reduction. Combined with several performance criteria, ivis is demonstrated to be effective in keeping both local and global features while offering key insights for the mechanism of protein allostery.
Various dimensionality reduction methods have been used in protein systems, such as PCA, t-ICA and t-SNE. As linear methods, PCA and t-ICA aim to capture the maximum variance and autocorrelation of protein motion, respectively. However, nonlinear dimensionality reduction methods, such as t-SNE, have been shown to be superior than linear methods in keeping the similarity between high dimension and low dimension.41 Nevertheless, limitations of t-SNE, such as being susceptible to system noise67 and poor performance in extracting global structure, hinder further interpretations for biological systems. Compared with these dimensionality reduction methods, ivis is outstanding in preserving distances in the low-dimensional spaces and could be utilized for biological system explanation.
Dimensionality reduction methods have different strengths in preserving structural information and can be applied to various datasets. While there is no universal standard measuring the performance of different methods, an appropriate method should reflect the distance and similarity between projections in low dimensional space. Similar structures in the high dimensional space should be close in the low dimensional space. This criterion is important in the construction of Markov state model, which requires clustering discrete microstates on the projections. Improper projections would lead to poor MSMs, thus obscuring the protein motions and hindering further structural-function study.68 Moreover, an adequate MSM requires the similarity between structures in each microstate. To evaluate the effectiveness of MSM, average RMSD value is often used as a good indicator for dimensionality reduction methods. With this regard, both unsupervised ivis and supervised ivis are suitable to build MSM in low dimensional space. Estimated relaxation timescale reflects the number of steady states and is used to construct kinetically stable macrostates. The timescale of protein motion ranges from milliseconds to seconds in experiments. Among all tested dimensionality reduction methods, ivis framework showed the longest timescale with over 10−5 second. This experimentally meaningful timescale, combined with the average RMSDs, suggests the success of ivis on the construction of MSM.
It is expected that Euclidean distances between data points in the high dimensional space should be proportional to the distances between the projected points in the low dimensional space. In the current study, long distance in the original dimensional space represents a high degree of dissimilarity in protein structure and the related two data points are more likely to be in different protein folding states. A well-behaved dimensionality reduction method should keep this correspondence in the low dimensional space. Several assessments are applied to quantify this relationship. Spearman’s rank-order correlation coefficient is calculated to test the linear relationship of pair-wised distances of data points. A potential problem is that distances are not independent. Rather, the change in position of one residue would lead to the change in the related n − 1 pair-wised distances. Therefore, to overcome this problem, the Mantel test is used to randomly permute rows and columns of distance matrix. The result of the Mantel test showed similar trend compared with that in Spearman correlation coefficient, which indicates that all methods are free of the dependency of distances and maintain good stability. The concept of the Shannon information in information theory is utilized to compare the information content in each projection space. The results of the above criteria show that ivis is capable of effectively separating different classes in the low dimensional space and preserve high dimensional distances with the least information loss. While high dimensional dataset is usually projected onto 2D surface, the effectiveness of MSMs on different dimensions were tested. Through the results of GMRQ, different methods showed various results. It is proposed that suitable dimensions are dependent on biological system and dimensionality reduction method. However, 2-dimension space is still desired for visualization purpose if it could represent sufficient biological information.
The protein contact map demonstrates the superiority of ivis dimensionality reduction method that ivis can both retain local and global information. Unexpectedly, asymmetrical nature of the AuLOV dimer is revealed by comparing the protein-protein interactions. There are several important residues identified by ivis framework. Met313, Leu331 and Cys351 have been reported as light-induced rotamers near cofactor FMN.22 These key residues are located on the surface of the β-sheet, which is consistent with and proves the concept of signaling mechanism that signals originated from the core of Per-ARNT-Sim (PAS) generate conformational change mainly within the β-sheet.63,64 Gln365 is important for the stability of Jα helix through the hydrogen bonding with Cys316.65 Leu248, Gln250 and Asn251 were also found important in modulating allostery within single chain, reported as A’α linker while Asn329 and Gln350 function as FMN stabilizer.66 Through the AuLOV dimerization, A’α and Jα helices undergo conformational changes and are expected to account for large importance, as shown in Table 2. However, the protein linkers, as well as Cα helix and Hβ and Iβ strands, also showed high importance. The significance of protein linkers in the current study is consistent with both experimental and computational findings69–72 that protein linkers are indispensable components in allostery and biological functions. Together, these unexpected structures are vital in AuLOV allostery and worth further study. Overall, several key residues and secondary structures identified by ivis framework agrees with the experimental finding, which consolidates the good performance of ivis in elucidating the protein allosteric process.
Computational cost should be considered when comparing dimensionality reduction methods, since it is computationally expensive for large datasets, especially for proteins. From this prospective, different models are benchmarked using a dummy dataset. Results showed that PCA requires the least computational resource, not subjected to either sample size or feature size. This might be due to the reason that PCA implemented in Scikit-learn uses SVD for acceleration. Further, since the size of dataset was large, randomized truncated SVD was applied to reduce the time complexity to with nmax = max(nsamples, nfeatures).73 While t-SNE is comparable with ivis regarding several assessments, the computational cost could be prohibitively expensive for large datasets as t-SNE has a time complexity of ,74 where N and D are the number of samples and features, respectively. Though tree-based algorithms have been developed to reduce the complexity to ,75 it is still challenging for the high-dimensional protein system. ivis exhibited less computational cost in higher sample size and dimension. Further, as shown in Figure 11, the loss of ivis model converges fast and the overall computational cost could have been further reduced with early stopping iterations. Combined with the performance criteria and runtime comparison, ivis framework is demonstrated as a superior dimensionality reduction method for protein system and can be an important member in the analysis toolbox for MD trajectory.
Conclusion
As originally developed for single-cell technology, ivis framework is applied in this study as a dimensionality reduction method for molecular dynamics simulations for biological macromolecules. ivis is superior than other dimensionality reduction methods in several aspects, ranging from preserving both local and global distances, maintaining similarity among data points in high dimensional space and projections, to retaining the most structural information through a series of performance assessments. ivis also shows great potential in interpreting biological system through the feature weights in the neural network layer. Overall, ivis reached a balance between dimensionality reduction performance and computational cost and is therefore promising as an effective tool for the analysis of macromolecular simulations.
Acknowledgement
Research reported in this paper was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award No. R15GM122013. Computational time was generously provided by Southern Methodist University’s Center for Research Computing. The authors thank Miss Xi Jiang from Biostatistics Ph.D. program in the Statistics department of SMU for fruitful discussions.
References
- (1).Lindorff-Larsen K; Piana S; Dror RO; Shaw DE How Fast-Folding Proteins Fold. Science 2011, 334, 517–520. [DOI] [PubMed] [Google Scholar]
- (2).Shaw DE; Dror RO; Salmon JK; Grossman J; Mackenzie KM; Bank JA; Young C; Deneroff MM; Batson B; Bowers KJ, et al. Millisecond-Scale Molecular Dynamics Simulations on Anton Proceedings of the conference on high performance computing networking, storage and analysis 2009; pp 1–11. [Google Scholar]
- (3).Eastman P; Friedrichs MS; Chodera JD; Radmer RJ; Bruns CM; Ku JP; Beauchamp KA; Lane TJ; Wang L-P; Shukla D, et al. OpenMM 4: A Reusable, Extensible, Hardware Independent Library for High Performance Molecular Simulation. J. Chem. Theory Comput 2013, 9, 461–469. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (4).Indyk P; Motwani R Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Proceedings of the thirtieth annual ACM symposium on Theory of computing 1998; pp 604–613. [Google Scholar]
- (5).Ichiye T; Karplus M Collective Motions in Proteins: A Covariance Analysis of Atomic Fluctuations in Molecular Dynamics and Normal Mode Simulations. Proteins: Struct., Funct., Bioinf 1991, 11, 205–217. [DOI] [PubMed] [Google Scholar]
- (6).Zhou H; Dong Z; Verkhivker G; Zoltowski BD; Tao P Allosteric Mechanism of the Circadian Protein Vivid Resolved Through Markov State Model and Machine Learning Analysis. PLoS Comput. Biol 2019, 15, e1006801. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (7).Freddolino PL; Liu F; Gruebele M; Schulten K Ten-Microsecond Molecular Dynamics Simulation of a Fast-Folding WW Domain. Biophys. J 2008, 94, L75–L77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (8).Roweis ST; Saul LK Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 2000, 290, 2323–2326. [DOI] [PubMed] [Google Scholar]
- (9).Tenenbaum JB; De Silva V; Langford JC A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 2000, 290, 2319–2323. [DOI] [PubMed] [Google Scholar]
- (10).Levy R; Srinivasan A; Olson W; McCammon J Quasi-Harmonic Method for Studying Very Low Frequency Modes in Proteins. Biopolymers 1984, 23, 1099–1112. [DOI] [PubMed] [Google Scholar]
- (11).Naritomi Y; Fuchigami S Slow Dynamics in Protein Fluctuations Revealed by Time-Structure Based Independent Component Analysis: The Case of Domain Motions. J. Chem. Phys 2011, 134, 02B617. [DOI] [PubMed] [Google Scholar]
- (12).Maaten L. v. d.; Hinton G Visualizing Data Using t-SNE. J. Mach. Learn. Res 2008, 9, 2579–2605. [Google Scholar]
- (13).Hinton GE; Salakhutdinov RR Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [DOI] [PubMed] [Google Scholar]
- (14).Ferguson AL; Panagiotopoulos AZ; Kevrekidis IG; Debenedetti PG Nonlinear Dimensionality Reduction in Molecular Simulation: The Diffusion Map Approach. Chem. Phys. Lett 2011, 509, 1–11. [Google Scholar]
- (15).Zhao X; Kaufman A Multi-Dimensional Reduction and Transfer Function Design Using Parallel Coordinates. Volume graphics. International Symposium on Volume Graphics 2010; p 69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (16).Suarez E; Adelman JL; Zuckerman DM Accurate Estimation of Protein Folding and Unfolding Times: Beyond Markov State Models. J. Chem. Theory Comput 2016, 12, 3473–3481. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (17).Doerr S; Ariz-Extreme I; Harvey MJ; De Fabritiis G Dimensionality Reduction Methods for Molecular Simulations. arXiv preprint arXiv:1710.10629 2017, [Google Scholar]
- (18).Szubert B; Cole JE; Monaco C; Drozdov I Structure-Preserving Visualisation of High Dimensional Single-Cell Datasets. Sci. Rep 2019, 9, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (19).Koch G; Zemel R; Salakhutdinov R Siamese Neural Networks for One-Shot Image Recognition ICML deep learning workshop 2015. [Google Scholar]
- (20).Hermans A; Beyer L; Leibe B In Defense of the Triplet Loss for Person Re-Identification. arXiv preprint arXiv:1703.07737 2017, [Google Scholar]
- (21).Takahashi F; Yamagata D; Ishikawa M; Fukamatsu Y; Ogura Y; Kasahara M; Kiyosue T; Kikuyama M; Wada M; Kataoka H AUREOCHROME, A Photoreceptor Required for Photomorphogenesis in Stramenopiles. Proc. Natl. Acad. Sci. U. S. A 2007, 104, 19625–19630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (22).Heintz U; Schlichting I Blue Light-Induced LOV Domain Dimerization Enhances the Affinity of Aureochrome 1a for Its Target DNA Sequence. eLife 2016, 5, e11860. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (23).Losi A; Gärtner W Bacterial Bilin- and Flavin-Binding Photoreceptors. Photochem. Photobiol. Sci 2008, 7, 1168–1178. [DOI] [PubMed] [Google Scholar]
- (24).Crosson S; Rajagopal S; Moffat K The LOV Domain Family: Photoresponsive Signaling Modules Coupled to Diverse Output Domains. Biochemistry 2003, 42, 2–10. [DOI] [PubMed] [Google Scholar]
- (25).Berman HM; Bhat TN; Bourne PE; Feng Z; Gilliland G; Weissig H; West-brook J The Protein Data Bank and the Challenge of Structural Genomics. Nat. Struct. Biol 2000, 7, 957–959. [DOI] [PubMed] [Google Scholar]
- (26).Freddolino PL; Gardner KH; Schulten K Signaling Mechanisms of LOV Domains: New Insights From Molecular Dynamics Studies. Photochem. Photobiol. Sci 2013, 12, 1158–1170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (27).Jorgensen WL; Chandrasekhar J; Madura JD; Impey RW; Klein ML Comparison of Simple Potential Functions for Simulating Liquid Water. J. Chem. Phys 1983, 79, 926–935. [Google Scholar]
- (28).Buchenberg S; Knecht V; Walser R; Hamm P; Stock G Long-Range Conformational Transition of a Photoswitchable Allosteric Protein: Molecular Dynamics Simulation Study. J. Phys. Chem. B 2014, 118, 13468–13476. [DOI] [PubMed] [Google Scholar]
- (29).Tsuchiya Y; Taneishi K; Yonezawa Y Autoencoder-Based Detection of Dynamic Allostery Triggered by Ligand Binding Based on Molecular Dynamics. J. Chem. Inf. Model 2019, 59, 4043–4051. [DOI] [PubMed] [Google Scholar]
- (30).Cerutti DS; Duke R; Freddolino PL; Fan H; Lybrand TP A Vulnerability in Popular Molecular Dynamics Packages Concerning Langevin and Andersen Dynamics. J. Chem. Theory Comput 2008, 4, 1669–1680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (31).Ben-David M; Huang H; Sun MG; Corbi-Verge C; Petsalaki E; Liu K; Gfeller D; Garg P; Tempel W; Sochirca I, et al. Allosteric Modulation of Binding Specificity by Alternative Packing of Protein Cores. J. Mol. Biol 2019, 431, 336–350. [DOI] [PubMed] [Google Scholar]
- (32).Braun E; Gilmer J; Mayes HB; Mobley DL; Monroe JI; Prasad S; Zuckerman DM Best Practices for Foundations in Molecular Simulations [Article v1.0]. Living J. Comput. Mol. Sci 2019, 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (33).Essmann U; Perera L; Berkowitz ML; Darden T; Lee H; Pedersen LG A Smooth Particle Mesh Ewald Method. J. Chem. Phys 1995, 103, 8577–8593. [Google Scholar]
- (34).Eastman P; Pande V OpenMM: A Hardware-Independent Framework for Molecular Simulations. Comput. Sci. Eng 2010, 12, 34–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (35).Brooks BR; Brooks III CL; Mackerell AD Jr; Nilsson L; Petrella RJ; Roux B; Won Y; Archontis G; Bartels C; Boresch S, et al. CHARMM: The Biomolecular Simulation Program. J. Comput. Chem 2009, 30, 1545–1614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (36).Foloppe N; MacKerell AD, Jr All-Atom Empirical Force Field for Nucleic Acids: I. Parameter Optimization Based on Small Molecule and Condensed Phase Macromolecular Target Data. J. Comput. Chem 2000, 21, 86–104. [Google Scholar]
- (37).Tian H; Tao P Deciphering the Protein Motion of S1 Subunit in SARS-CoV-2 Spike Glycoprotein Through Integrated Computational Methods. arXiv preprint arXiv:2004.05256 2020, [DOI] [PMC free article] [PubMed] [Google Scholar]
- (38).Nair V; Hinton GE Rectified Linear Units Improve Restricted Boltzmann Machines. Proceedings of the 27th international conference on machine learning (ICML-10) 2010; pp 807–814. [Google Scholar]
- (39).Klambauer G; Unterthiner T; Mayr A; Hochreiter S Self-Normalizing Neural Networks Advances in neural information processing systems 2017; pp 971–980. [Google Scholar]
- (40).Golub GH; Reinsch C Linear Algebra; Springer, 1971; pp 134–151. [Google Scholar]
- (41).Zhou H; Wang F; Tao P t-Distributed Stochastic Neighbor Embedding Method with the Least Information Loss for Macromolecular Simulations. J. Chem. Theory Comput 2018, 14, 5499–5510. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (42).Ulyanov D Multicore-TSNE https://github.com/DmitryUlyanov/Multicore-TSNE, 2016; access date: March 17, 2020.
- (43).Benesty J; Chen J; Huang Y; Cohen I Noise reduction in speech processing; Springer, 2009; pp 1–4. [Google Scholar]
- (44).Adler J; Parmryd I Quantifying Colocalization by Correlation: The Pearson Correlation Coefficient Is Superior to the Mander’s Overlap Coefficient. Cytometry, Part A 2010, 77, 733–742. [DOI] [PubMed] [Google Scholar]
- (45).Diniz-Filho JAF; Soares TN; Lima JS; Dobrovolski R; Landeiro VL; Telles M. P. d. C.; Rangel TF; Bini LM Mantel Test in Population Genetics. Genet. Mol. Biol 2013, 36, 475–485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (46).Carr JW MantelTest 2013; https://github.com/jwcarr/MantelTest, access date: April 5, 2020.
- (47).McGibbon RT; Schwantes CR; Pande VS Statistical Model Selection for Markov Models of Biomolecular Dynamics. J. Phys. Chem. B 2014, 118, 6475–6481. [DOI] [PubMed] [Google Scholar]
- (48).Harrigan MP; Sultan MM; Hernández CX; Husic BE; Eastman P; Schwantes CR; Beauchamp KA; McGibbon RT; Pande VS MSMBuilder: Statistical Models for Biomolecular Dynamics. Biophys. J 2017, 112, 10–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (49).McGibbon RT; Pande VS Variational Cross-Validation of Slow Dynamical Modes in Molecular Kinetics. J. Chem. Phys 2015, 142, 03B621_1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (50).Liaw A; Wiener M, et al. Classification and Regression by randomForest. R news 2002, 2, 18–22. [Google Scholar]
- (51).Pedregosa F; Varoquaux G; Gramfort A; Michel V; Thirion B; Grisel O; Blondel M; Prettenhofer P; Weiss R; Dubourg V, et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res 2011, 12, 2825–2830. [Google Scholar]
- (52).Hecht-Nielsen R Neural networks for perception; Elsevier, 1992; pp 65–93. [Google Scholar]
- (53).Kingma DP; Ba J Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 2014, [Google Scholar]
- (54).Chollet F, et al. Keras https://keras.io, 2015; access date: April 1, 2020.
- (55).Pande VS; Beauchamp K; Bowman GR Everything You Wanted to Know About Markov State Models but Were Afraid to Ask. Methods 2010, 52, 99–105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (56).Bowman GR; Beauchamp KA; Boxer G; Pande VS Progress and Challenges in the Automated Construction of Markov State Models for Full Protein Systems. J. Chem. Phys 2009, 131, 124101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (57).Chodera JD; Singhal N; Pande VS; Dill KA; Swope WC Automatic Discovery of Metastable States for the Construction of Markov Models of Macromolecular Conformational Dynamics. J. Chem. Phys 2007, 126, 04B616. [DOI] [PubMed] [Google Scholar]
- (58).Schubert E; Gertz M Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection. International Conference on Similarity Search and Applications 2017; pp 188–203. [Google Scholar]
- (59).Zhou Y; Sharpee T Using Global t-SNE to Preserve Inter-Cluster Data Structure. bioRxiv 2018, 331611. [Google Scholar]
- (60).Wang F; Zhou H; Wang X; Tao P Dynamical Behavior of β-Lactamases and Penicillin-Binding Proteins in Different Functional States and Its Potential Role in Evolution. Entropy 2019, 21, 1130. [Google Scholar]
- (61).Wang F; Shen L; Zhou H; Wang S; Wang X; Tao P Machine Learning Classification Model for Functional Binding Modes of TEM-1 β-Lactamase. Front. Mol. Biosci 2019, 6, 47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (62).Romano P; Guenza M GRadient Adaptive Decomposition (GRAD) Method: Optimized Refinement Along Macrostate Borders in Markov State Models. J. Chem. Inf. Model 2017, 57, 2729–2740. [DOI] [PubMed] [Google Scholar]
- (63).Zoltowski BD; Schwerdtfeger C; Widom J; Loros JJ; Bilwes AM; Dunlap JC; Crane BR Conformational Switching in the Fungal Light Sensor Vivid. Science 2007, 316, 1054–1057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (64).Möglich A; Ayers RA; Moffat K Structure and Signaling Mechanism of Per-ARNT-Sim Domains. Structure 2009, 17, 1282–1294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (65).Herman E; Sachse M; Kroth PG; Kottke T Blue-Light-Induced Unfolding of the Jα Helix Allows for the Dimerization of Aureochrome-LOV From the Diatom Phaeodactylum tricornutum. Biochemistry 2013, 52, 3094–3101. [DOI] [PubMed] [Google Scholar]
- (66).Arinkin V; Granzin J; Röllen K; Krauss U; Jaeger K-E; Willbold D; Batra-Safferling R Structure of a LOV Protein in Apo-State and Implications for Construction of LOV-Based Optical Tools. Sci. Rep 2017, 7, 42971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (67).Amid E; Warmuth MK A More Globally Accurate Dimensionality Reduction Method Using Triplets. arXiv preprint arXiv:1803.00854 2018, [Google Scholar]
- (68).Prinz J-H; Wu H; Sarich M; Keller B; Senne M; Held M; Chodera JD; Schütte C; Noé F Markov Models of Molecular Kinetics: Generation and Validation. J. Chem. Phys 2011, 134, 174105. [DOI] [PubMed] [Google Scholar]
- (69).Ma B; Tsai C-J; Haliloğlu T; Nussinov R Dynamic Allostery: Linkers Are Not Merely Flexible. Structure 2011, 19, 907–917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (70).Reddy Chichili VP; Kumar V; Sivaraman J Linkers in the Structural Biology of Protein-Protein Interactions. Protein Sci 2013, 22, 153–167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (71).George RA; Heringa J An Analysis of Protein Domain Linkers: Their Classification and Role in Protein Folding. Protein Eng., Des. Sel 2002, 15, 871–879. [DOI] [PubMed] [Google Scholar]
- (72).Gokhale RS; Khosla C Role of Linkers in Communication Between Protein Modules. Curr. Opin. Chem. Biol 2000, 4, 22–27. [DOI] [PubMed] [Google Scholar]
- (73).Halko N; Martinsson P-G; Tropp JA Finding Structure with Randomness: Stochastic Algorithms for Constructing Approximate Matrix Decompositions 2009,
- (74).Ding J; Condon A; Shah SP Interpretable Dimensionality Reduction of Single Cell Transcriptome Data with Deep Generative Models. Nat. Commun 2018, 9, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (75).Van Der Maaten L Accelerating t-SNE Using Tree-Based Algorithms. J. Mach. Learn. Res 2014, 15, 3221–3245. [Google Scholar]