Using Markov state models to study self-assembly

Matthew R Perkett; Michael F Hagan

doi:10.1063/1.4878494

. 2014 Jun 2;140(21):214101. doi: 10.1063/1.4878494

Using Markov state models to study self-assembly

Matthew R Perkett ¹, Michael F Hagan ^1,^a)

PMCID: PMC4048447 PMID: 24907984

Abstract

Markov state models (MSMs) have been demonstrated to be a powerful method for computationally studying intramolecular processes such as protein folding and macromolecular conformational changes. In this article, we present a new approach to construct MSMs that is applicable to modeling a broad class of multi-molecular assembly reactions. Distinct structures formed during assembly are distinguished by their undirected graphs, which are defined by strong subunit interactions. Spatial inhomogeneities of free subunits are accounted for using a recently developed Gaussian-based signature. Simplifications to this state identification are also investigated. The feasibility of this approach is demonstrated on two different coarse-grained models for virus self-assembly. We find good agreement between the dynamics predicted by the MSMs and long, unbiased simulations, and that the MSMs can reduce overall simulation time by orders of magnitude.

INTRODUCTION

The assembly of basic units into structures with increased size and complexity is central to biology, where examples of assembled structures include viruses (e.g., Refs. ¹^,²^,³^,⁴), cell membranes, cytoskeletal filaments,⁵ and ordered layers of proteins on bacterial surfaces.⁶ Assembly is also increasingly important to nanoscience, where interactions between colloidal particles are being engineered to drive assembly into sophisticated, functional materials (e.g., Refs. ⁷^,⁸^,⁹^,¹⁰^,¹¹^,¹²^,¹³^,¹⁴^,¹⁵) and DNA origami promises the ability to build structures of nearly limitless complexity (e.g., Refs. ¹⁶^,¹⁷^,¹⁸). An important focus of current research is understanding how the interactions between individual components determine assembly pathways, timescales, and fidelity for a target structure. Computational modeling can play a key role in determining assembly pathways and mechanisms, since most intermediates are transient and thus not readily characterized in experiments. However, simulating assembly is challenging because target structures can be orders of magnitude larger than their constituent components and assembly pathways typically surmount large free energy barriers, leading to timescales which greatly exceed computational limitations.

This paper is concerned with using Markov State Models (MSMs) to overcome the gap between assembly times and computationally accessible timescales. Many powerful enhanced sampling techniques have been developed to efficiently harvest computational trajectories that include barrier crossings or other rare events (e.g., Refs. ¹⁹^,²⁰^,²¹^,²²^,²³^,²⁴^,²⁵^,²⁶^,²⁷^,²⁸^,²⁹^,³⁰^,³¹^,³²^,³³^,³⁴). However, many of these methods are limited by requiring a priori knowledge of reaction coordinates, one or few pathways to completion, or only several metastable minima. In contrast, MSMs can be used to study assembly reactions characterized by multiple free energy barriers, a diverse ensemble of pathways, and pleomorphic products. Furthermore, MSMs are one of only a few methods³⁵ that can describe non-stationary, out-of-equilibrium dynamical processes.

Previous works on assembly of capsids or other structures have postulated Markov state models in which the state space and transition rates are pre-assumed based on physical considerations.³⁶^,³⁷^,³⁸^,³⁹^,⁴⁰^,⁴¹^,⁴²^,⁴³^,⁴⁴^,⁴⁵ In the approach described here, the state space and transition rates emerge from particle-based dynamics simulations, and the validity of the Markov assumption is explicitly tested against the microscopic dynamics.

While such MSMs have been extensively developed in the context of protein folding,⁴⁶^,⁴⁷^,⁴⁸^,⁴⁹^,⁵⁰^,⁵¹^,⁵²^,⁵³^,⁵⁴^,⁵⁵^,⁵⁶^,⁵⁷^,⁵⁸^,⁵⁹^,⁶⁰ existing approaches cannot describe the assembly of disconnected, permutable subunits. Here, we present a method to construct MSMs that is applicable to a wide variety of such assembly reactions. We test our approach on two models, which respectively describe the assembly of viral proteins around rigid nanoparticles and flexible polymers. While straightforward dynamics simulations with similar models have led to important insights about capsid assembly (e.g., Refs. ⁶¹^,⁶²^,⁶³^,⁶⁴^,⁶⁵^,⁶⁶^,⁶⁷^,⁶⁸^,⁶⁹^,⁷⁰^,⁷¹^,⁷²^,⁷³^,⁷⁴^,⁷⁵^,⁷⁶, reviewed in Refs. ⁴^,⁷⁷), these investigations have been limited to parameters for which nucleation barriers are small. We show that the MSM approach enables simulation over a much wider range of experimentally relevant parameter values.

METHODS

Existing implementations of MSMs

In this section, we review how relatively short, unbiased simulations can be used to build an MSM and how this technique has been applied to study protein folding or conformational transitions. The procedure begins by partitioning configurations from the short simulations into states such that conformations which interconvert rapidly are collected into the same state. The separation of timescales resulting from this partitioning ensures that the model is Markovian on timescales longer than a “lag time” τ, meaning that the probability of transitioning to a new state only depends upon the current state. Taking $\vec{P} (0)$ to be the vector of probabilities of being in each of the possible states of the system at time t = 0, the state probability at time t_f is given by $\vec{P} (t_{f}) = T {(τ)}^{n} \vec{P} (0)$ where n = t_f/τ and $T (τ)$ is the stochastic matrix of interstate transition probabilities estimated from the simulations at lag time τ.

Determining a state decomposition that achieves the separation of timescales described above is a crucial aspect of building an MSM. If the states do not sufficiently distinguish values of all of the slow degrees of freedom in the system, then the lag time τ at which the system becomes Markovian will be comparable to its longest relaxation timescale. Since simulations must be greater than τ in length, the “short” simulations will approach the length of long, unbiased trajectories and thus the method will offer no computational savings.

Several approaches to determining state decompositions have been developed in the context of all-atom protein simulations.⁴⁶^,⁴⁷^,⁴⁸^,⁴⁹^,⁵⁰^,⁵¹^,⁵²^,⁵³^,⁵⁴^,⁵⁵^,⁵⁶^,⁵⁷^,⁵⁸^,⁵⁹^,⁶⁰^,⁷⁸^,⁷⁹^,⁸⁰^,⁸¹^,⁸²^,⁸³^,⁸⁴^,⁸⁵^,⁸⁶^,⁸⁷^,⁸⁸^,⁸⁹^,⁹⁰^,⁹¹^,⁹²^,⁹³ In cases where a set of collective coordinates describing all of the slow degrees of freedom is known (or guessed) a priori, biased sampling can be used to determine the free energy landscape as a function of these coordinates. States can then be defined based on local free energy minima (e.g., Refs. ⁴⁹ and ⁷⁸^,⁷⁹^,⁸⁰^,⁸¹). Since it is rare to have a priori knowledge of good collective coordinates, alternative approaches have been developed in which configurations are clustered based on geometric criteria, such as structural similarity (e.g., Refs. ⁸²^,⁸³^,⁸⁴^,⁸⁵). Chodera et al.⁵⁹ developed an algorithm to refine an initial geometric partitioning of “microstates” into “macrostates” based on kinetics. Open source software packages such as MSMBuilder⁸⁶^,⁸⁷ and EMMA⁹⁴ provide a suite of tools for building MSMs in this fashion and analyzing them. This algorithm and similar approaches have been shown to be extremely powerful for the study of proteins, allowing for prediction of folding pathways and rates on even the supra-millisecond timescale (e.g., Refs. ⁵⁴, and ⁸⁷^,⁸⁸^,⁸⁹^,⁹⁰) as well as identifying hidden allosteric sites.⁹¹ Recently, systematic approaches to find optimal coordinates for constructing MSMs have been developed.⁹²^,⁹³

Building MSMs for assembly systems

In contrast to protein folding, where each residue in the protein has a unique index, assembly subunits are permutable and thus cannot be indexed. Therefore, existing algorithms for determining state decompositions cannot be directly applied to self-assembly. Here, we describe several approaches to creating a state decomposition based on the network of subunit interactions and their positions relative to heterogeneous nucleation sites.

We consider systems in which subunits can assemble either through homogeneous nucleation to form a single component structure (e.g., an empty virus capsid shell) or through heterogeneous nucleation around a scaffold (e.g., a polymer or a nanoparticle) to form a multicomponent structure. To simplify the presentation, we consider one type of subunit and focus on only the largest assembled structure in the system at any given time, but the approach can be generalized to multiple subunit species and assemblies. Our approach can be applied to systems which assemble via reversible or irreversible interactions; in both cases, we will describe a strong interaction as a “bond.”

To generate a state decomposition, we categorize subunits into three classes: class I: bonded subunits in the assemblage, class II: subunits bonded to the scaffold, but not the assemblage, and class III: unbonded, free subunits. In the case of homogeneous nucleation, there are no class II subunits (since there is no scaffold), and class III subunits can be ignored if subunit association to the scaffold is sufficiently reaction-rate limited that the density of free subunits is spatially uniform.

For systems that assemble into well-defined structures, fluctuations in bond distances and angles are fast compared to bond formation and breakage. By averaging over these short timescale fluctuations, the unique structure of a growing cluster can be defined by the class I subunit bonding network. More precisely, each cluster is converted into an undirected graph with nodes corresponding to subunits and edges corresponding to bonds between the subunits (see Fig. 11 in Appendix B). This ensures a consistent state definition that is unaffected by exchanges of subunits with different indices, short-timescale conformational fluctuations, or rigid body motions of the assemblage. Class I subunits can be further sub-partitioned by including the distance from the scaffold to each subunit, but this was unnecessary for the systems that we considered. Class II subunits can be handled in a similar manner by considering the subunit-scaffold bonding network, but we found that it was only necessary to track the total number of class II subunits.

(a) and (c) show clusters of patchy spheres growing on the nanoparticle during assembly while (b) and (d) show their respective graph representations. Attractors are not shown, but bonds (strong attractions) between subunits are depicted as teal cylinders. In this work, only the largest cluster was considered when constructing the MSM, which would exclude the lower dimer in (c) and (d) from consideration.

Class III subunits must be included in the state decomposition when their association to the cluster or scaffold approaches the diffusion limit, which results in density inhomogeneities. Free subunit positions fluctuate rapidly since they are not involved in strong interactions, and thus we use their density distribution rather than their positions to decompose states. We follow the approach developed by Gu et al.⁹⁵ to include solvent degrees of freedom in protein folding MSMs. We define a vector w, in which each index corresponds to a subunit in the cluster or a residue of the scaffold and each component corresponds to a distance-weighted density of free subunits around the indexed subunit,

w_{i} \equiv \sum_{j \in free subunits} e^{- | {\vec{r}}_{j} - {\vec{r}}_{i} |^{2} / σ_{d}}

(1)

with ${\vec{r}}_{i}$ and ${\vec{r}}_{j}$ as the positions of cluster/scaffold subunit i and the free subunit j. σ_d is an adjustable decay length that sets the scale for relevant interactions between free subunits and the scaffold or growing cluster. This definition weighs nearby subunits more heavily since they are more likely to associate with the cluster or scaffold. Whether cluster subunits, scaffold residues, or both need to be considered in this definition depends on which association reactions approach the diffusion limit. For example, in the case considered below where subunit adsorption onto the nanoparticle approaches the diffusion limit but subunit-subunit associations do not, it is only necessary to include the nanoparticle in w.

Finally, if scaffold internal degrees of freedom (e.g., conformations of a polymer) evolve slowly, these should also be included in the state definition. Because the scaffold residues (i.e., polymer segments) are indexable, these degrees of freedom can be treated via the existing RMSD-based approach (e.g., in MSMBuilder⁸⁶^,⁸⁷ and EMMA⁹⁴).

Reducing the number of states

As the size of the target structure increases, the number of distinct assembly intermediates and hence the number of unique graphs grows rapidly. The number of states could become intractable if class II or class III subunits were included in the state definition. There are two routes to reduce the number of states. First, kinetic data can be used to group microstates that interconvert rapidly into macrostates by following the approach based on Perron cluster analysis in MSMBuilder⁶⁰^,⁸⁶^,⁸⁷ and EMMA.⁹⁴ Second, a priori knowledge of the system can be used to reduce the number of unique states. Since assembly into a target structure with high fidelity generally requires weak subunit-subunit interactions,⁴^,⁹⁶^,⁹⁷ subunits in a cluster with only one bond rapidly dissociate. Thus, the edges corresponding to these interactions can generally be neglected when building graphs. For the models considered here, we found that fluctuations about the most compact, highly bonded structures are rapid enough that it was sufficient to consider only the number of subunits in a cluster. Several alternative simplified descriptions are discussed in Appendix B.

Generating the transition matrix

The transition probability matrix T(τ) is calculated by column-normalizing the count matrix C(τ), in which each element C_ji gives the total number of transitions from state i to state j measured at a lag time τ. The count matrix can be calculated from many, relatively short, unbiased trajectories run in parallel. Because of the Markov property, the initial conditions for these trajectories can be chosen to efficiently generate good statistics for all of the relevant transition elements.

When no information is available a priori about which transition elements are most significant, one can use a ratcheting procedure (Appendix C). Many simulations are run in parallel for a time t_s, which must be longer than the lag time τ but can be much shorter than the longest relaxation timescale. Microstates are then determined from coordinates saved during these trajectories, and a new ensemble of trajectories is started with initial conditions preferentially chosen from the microstates with the poorest sampling. This procedure is repeated until T has satisfactorily converged. Once sufficient statistics have been gathered to crudely estimate T, more systematic adaptive sampling⁴⁷ can be used to choose initial conditions that will reduce the statistical uncertainty of the MSM. However, the initial ratcheting procedure already allows for tremendous speed up in comparison to long, brute force (without enhanced sampling) simulations as it enables the system to cross free energy barriers in linear rather than exponential simulation times. Note that because the protocol does not generate initial conditions according to the equilibrium distribution, the count matrix C should not be symmetrized when simulating assembly dynamics. In fact, even if C is estimated from long, unbiased trajectories that achieve formation of the target structure it should not be symmetrized when calculating dynamics, since assembly to the target structure is an out-of-equilibrium process.

Analysis of MSMs

In this subsection, we briefly review analysis of constructed MSMs and discuss an application which is useful for analyzing assembly reactions. Upon spectral decomposition of the transition matrix, the time-dependent state probabilities can be written as

\begin{matrix} \vec{P} (t; τ) & = & \sum_{i = 1}^{N} | i ⟩ ⟨ i | | \vec{P} (0) ⟩ e^{- λ_{i} t}, \\ λ_{i} & = & - \log (ω_{i}) / τ, \end{matrix}

(2)

where ω_i is the ith eigenvalue of T(τ) and $⟨ i |$ and $| i ⟩$ are the corresponding left/right eigenvectors, which are assumed to be normalized. Since T(τ) is generally not Hermitian, the left and right eigenvectors are not equivalent. Because the rate matrix is stochastic, there is only one unit eigenvalue, whose associated right eigenvector corresponds to the equilibrium distribution, while all other eigenvalues are positive and real.⁵⁰ The implied timescale, $λ_{i}^{- 1}$ corresponds to the relaxation timescale for eigenmode i. For lag times on which the system satisfies the Markov assumption, the calculated implied timescales are nearly independent of τ.⁸⁸ Checking the convergence of the implied timescales is useful for selecting an appropriate τ, but does not guarantee that the model is Markovian, which also requires converged eigenvectors.⁸⁸

Self-assembly reactions

It is often useful to calculate the completion fraction f_c(t), which is defined as the fraction of structures in the target state as a function of time. This quantity can be compared to light scattering or size exclusion chromatography experiments.⁴³^,⁹⁸^,⁹⁹^,¹⁰⁰^,¹⁰¹^,¹⁰²^,¹⁰³^,¹⁰⁴^,¹⁰⁵ With $\vec{P}$ ordered such that index 1 corresponds to the initial, unassembled state and the largest index N corresponds to the target state, f_c is given by P_N(t). Inserting P_i(0) = δ_{i, 1}, with δ the Kroniker delta, into Eq. 2 gives

f_{c} (t) = \sum_{i = 1}^{N} {| i ⟩}_{N} {⟨ i |}_{1} e^{- λ_{i} t},

(3)

where ${⟨ i |}_{n}$ indicates the nth index of $⟨ i |$ . The mean completion time ${\bar{t}}_{c} = \int_{0}^{\infty} (\frac{d}{d t} f_{c} (t)) t d t$ then follows as

{\bar{t}}_{c} = \sum_{i = 2}^{N} {| i ⟩}_{N} {⟨ i |}_{1} \frac{1}{λ_{i}} e^{- λ_{i} t} .

(4)

Analysis using transition path theory

Insight into assembly mechanisms can be obtained from MSMs using Transition Path Theory (TPT),¹⁰⁶^,¹⁰⁷ which has been developed in the context of MSMs in Refs. ⁵⁴^,⁹⁰^,¹⁰⁸^,¹⁰⁹. We state two of these results which are particularly useful for analyzing assembly reactions here. The microstates that correspond to the transition state ensemble can be identified by calculating the committor probability for each state, which is the probability that a dynamical trajectory initiated from a given state will subsequently visit the target state.¹¹⁰^,¹¹¹ We define A as the set of mostly unassembled states that rapidly interconvert (the reactant states), B as the target structure (the product state), and I as all other states (the intermediate states). The forward committor probability, $q_{i}^{+}$ , is the probability that a trajectory started in state i will visit B before A and is given by solving⁵⁴

q_{i}^{+} - \sum_{j \in I} T_{i j} q_{j}^{+} = \sum_{j \in B} T_{i j} .

(5)

Similarly, the backward committor probability $q_{i}^{-}$ is the probability that the system was more recently in state A than B. For an equilibrium system, the committor probabilities are related by⁵⁴ $q_{i}^{-} = 1 - q_{i}^{+}$ , but for the models considered here, the target structure acts as an absorbing state (see Appendix C), which gives $q_{i}^{-} = 1$ for i∉B.

The relative probabilities of different assembly pathways can be calculated from the flux between states, which is given by⁵⁴ $f_{i j} = π_{i} q_{i}^{-} T_{i j} q_{j}^{+}$ , with π_i as the stationary probability of being found in state i. Since f_ij contains non-productive loops that are not on the pathway to completion, the forward flux $f_{i j}^{+}$ is defined by subtracting out these contributions:⁵⁴

f_{i j}^{+} = Max (0, f_{i j} - f_{j i}) .

(6)

MODELS

To test and benchmark our MSM framework, we consider two previously studied models for viral assembly, which differ in their level of detail and more importantly in the type of cargo being packaged. Both models represent capsid protein subunits as rigid bodies with excluded volume geometries and orientation-dependent interactions (following, e.g., Refs. ⁶¹^,⁶²^,⁶³^,⁶⁴^,⁶⁵^,⁶⁶^,⁶⁷^,⁶⁸^,⁶⁹^,⁷⁰^,⁷¹^,⁷²^,⁷³^,⁷⁴^,⁷⁵^,⁷⁶, and ¹¹²^,¹¹³^,¹¹⁴), designed such that the lowest energy structure is an icosahedron with 20 subunits. Each subunit can be thought of as describing a trimer of proteins that form a T = 1 capsid.⁴

Patchy sphere model

The first model is motivated by experiments in which capsid proteins from brome mosaic virus (BMV) or Hepatitis B virus (HBV) capsid proteins assemble around nanoparticles functionalized with negative charge.¹¹⁵^,¹¹⁶^,¹¹⁷^,¹¹⁸^,¹¹⁹^,¹²⁰^,¹²¹^,¹²²^,¹²³^,¹²⁴^,¹²⁵^,¹²⁶ Following the approach developed by Schwartz et al.,⁶¹ the subunit excluded volume is spherically symmetric and three attractive patches (bond vectors) are rigidly fixed to the subunit, with each pair of bond vectors forming an angle of 108° (see Fig. 1 and Eq. A1). There is a favorable interaction between subunits when (1) the ends of bond vectors nearly overlap, (2) the bond vectors are nearly anti-parallel, and (3) the secondary bond vectors are nearly coplanar. Twenty subunits realizing these conditions results in the minimum energy target structure (a complete capsid) shown in Fig. 1. The interaction strength is tuned by the parameter ɛ_B. The nanoparticle has a spherical excluded volume and short-range attractive interactions with capsid subunits, which are tuned by the parameter ε_S and qualitatively represent screened electrostatic attractions. More details about this model can be found in Appendix A1.

Triangles model

This model represents a capsid protein subunit using multiple, spherical “excluders” that enforce excluded volume and spherical “attractors” with short-range, pairwise attractions tuned by the parameter ɛ_cc. Excluders and attractors are arranged so that the minimum energy capsid is an icosahedron (Fig. 2). The cargo considered here is a self-avoiding bead-spring polymer with a persistence length comparable to that of single-stranded RNA (ssRNA) with no base pairing. Polymer beads experience short range attractive interactions with polymer-attractors located on the bottom of protein subunits (see Fig. 2); the interaction strength is tuned by the parameter ɛ_cp. This model has previously been used to study assembly around a polymer⁷² and empty capsid assembly.⁷³ Similar models were applied to empty capsids by Rapaport⁶⁹^,⁷⁰^,⁷¹ and Nguyen et al.⁶³ More details are given in Appendix A2.

Triangle model geometry. (a) Subunit geometry with grey excluders, green polymer attractors, and teal subunit attractors. Subunits experience an attractive interaction when the subunit attractors nearly overlap. (b) A cutaway view of the complete capsid with the encapsulated polymer shown in blue. (c) The complete capsid, which contains 20 subunits.

Simulations and units

Subunit positions and orientations are propagated using overdamped Brownian dynamics according to a second order predictor-corrector algorithm.¹²⁷ To represent an experiment with excess capsid protein, each simulation includes a single scaffold (nanoparticle or polymer) and is coupled to a bulk solution by performing grand canonical Monte Carlo moves, in which subunits at the periphery of the simulation box are exchanged with a reservoir at fixed chemical potential with a frequency consistent with the diffusion limited rate.⁷²^,¹²² To obtain dimensionless units, we rescale energies by k_BT and times by a characteristic diffusion timescale (see Appendix A).

RESULTS

We performed simulations over a wide range of parameters to test the ability of the MSMs to accurately reproduce assembly dynamics and to determine the extent of computational speed up in comparison to brute force calculations. To evaluate accuracy, we compare the MSM (Eq. 3) and brute force dynamics predictions for the cumulative distribution of assembly times, f_c(t) in Figs. 3 4. We find that this comparison provides a stringent test of the MSM, as it requires an accurate estimate of all statistically relevant elements of the transition matrix. In particular, capturing the assembly lag phase (the time before the first target structures appear, see Fig. 3) requires accurate estimates for as many as 20 implied timescales and their associated eigenvectors.

Patchy sphere model: Fraction complete (f_c) as a function of time from the brute-force calculations (symbols) and the MSM calculations (lines). The subunit-subunit binding energy parameter is ɛ_B = 10 for all simulations and the values of the subunit-nanoparticle interaction parameter ε_S are indicated on the plot. MSM states are defined by the number of subunits adsorbed to the nanoparticle and the largest cluster size, except for ε_S = 9, which also includes Eq. 1 for density variations around the nanoparticle with σ_d = 4σ. Notice that a semi-log scale is used to accommodate a wide range of assembly times. The brute force estimates of f_c(t) are calculated from the completion times for 500–1000 unbiased simulations.

Triangles model: Fraction complete as a function of time for the brute-force calculations (symbols) and the MSM calculations (lines) for indicated values of the subunit-subunit (ɛ_cc) and subunit-polymer (ɛ_cp) interaction parameters. It was not possible to simulate assembly using brute-force calculations for ɛ_cp = 3.0. MSM states are defined by the number of subunits adsorbed to the polymer and the largest cluster size.

The assembly time distributions are accurately predicted from the MSMs for a wide range of parameter values that represent different assembly mechanisms (Figs. 3 4). In the patchy sphere model, assembly for the weakest subunit-nanoparticle interaction (ε_S = 6.6) is heavily nucleation-dominated, while for the strongest value (ε_S = 9) the nucleation timescale is comparable to the elongation timescale¹²⁸ (i.e., the time required for a critical nucleus to grow to completion¹²⁹). These two scenarios are distinguished by the spectrum of implied timescales (Fig. 5). MSMs corresponding to nucleation-dominated parameter sets are characterized by a wide separation between the two largest implied timescales, whereas the MSM corresponding to ε_S = 9 yields a dense spectrum of implied timescales. As discussed in Sec. 4C, the different parameter values give rise to very different assembly pathways as well.

Implied timescales $λ_{i}^{- 1}$ for the patchy sphere model. The five largest timescales are shown for (a) slow, nucleation dominated assembly (ɛ_B = 10, ε_S = 6.6) and (b) rapid assembly (ɛ_B = 10, ε_S = 9). For (a), the largest implied timescale (excluding the unit eigenvalue) corresponds to the nucleation timescale.

For most parameter sets, we found that the minimal state definition capable of reproducing assembly dynamics included the number of subunits in the largest cluster and the number of subunits adsorbed to the nanoparticle or polymer. However, it is worth noting that the ratcheting procedure used to estimate the transition matrix considered not only the cluster size, but also the number of intra-cluster bonds (Appendix C). Ratcheting was less efficient when only the cluster size was considered, which suggests that the bonding network within a cluster is important. For the parameter set with the strongest subunit-nanoparticle interaction strength (ɛ_s = 9 in Fig. 3), subunits rapidly adsorbed onto the nanoparticle and it was also necessary to include the free subunit density distribution (Eq. 1) when building the MSM.

Although a simple state definition yields accurate results for these models, a more detailed order parameter that describes the cluster structure will be required in other situations. To show that MSM construction is feasible even with our most general definition, we also generated MSMs using the graph coordinate (described in the Appendix B). As shown in Fig. 6 the predicted assembly time distributions are identical to those predicted from the simpler coordinates.¹³⁰

Feasibility of the graph coordinate. The fraction complete for the patchy sphere model calculated from MSMs defined using the number of subunits adsorbed to the nanoparticle and either the largest cluster size (dashed line) or the graph coordinate (solid line) for indicated parameter values. Data from long, unbiased simulations is shown as points.

Testing convergence

To test that a system is Markovian on timescales corresponding to the lag time τ used to build the transition matrix, one typically checks that the implied timescales are nearly lag-time-independent for lag times equal to or exceeding τ (Fig. 5). For assembly systems, convergence can be more stringently tested by determining if the predicted assembly time distribution f_c, which depends on all of the implied timescales and associated eigenvectors, becomes independent of lag time (Fig. 7).

Simulation time

To assess the computational speed up afforded by MSMs for our systems, we calculated a scaled error Θ for the estimated mean assembly time ${\bar{t}}_{c}$ :

Θ (t_{T}) = {(\frac{{\bar{t}}_{c} (t_{T}) - {\bar{t}}_{c} (t_{f})}{{\bar{t}}_{c} (t_{f})})}^{2} .

(7)

Here t_T is the total simulation time accrued during the short trajectories used to estimate the transition probability matrix; t_T was varied by changing the number of short trajectories. We neglected computational overhead associated with initializing short trajectories and spectral decomposition of the transition matrix, as these factors were negligible. We also calculated Θ as a function of simulation time for straightforward dynamics, with t_T varied by changing the number of trajectories used to estimate ${\bar{t}}_{c} (t_{f})$ . As shown in Fig. 7, the MSM calculation converges with an order of magnitude less simulation time than the estimate based on straightforward dynamics. As expected, the magnitude of speedup depends on the separation of timescales, with greater speed up for large nucleation barriers and only limited speed up in growth-dominated regimes with dense timescale spectra. Importantly, the method is not hindered by multiple, large nucleation barriers such as tend to occur for low values of ɛ_B in these capsid assembly systems.⁶² The MSM calculation shown for the lowest polymer-subunit interaction in Fig. 4 (ɛ_cp = 3.0) demonstrates that efficient convergence can be achieved for parameters which are inaccessible to unbiased simulations; a typical unbiased simulation at these parameters would require 2 × 10⁴ cpu hours (over 2 cpu years).