CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

Kiersten M Ruff; Tyler S Harmon; Rohit V Pappu

doi:10.1063/1.4935066

. 2015 Nov 9;143(24):243123. doi: 10.1063/1.4935066

CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

Kiersten M Ruff ¹, Tyler S Harmon ², Rohit V Pappu ^3,^a)

PMCID: PMC4644154 PMID: 26723608

Abstract

We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, and a Gaussian process Bayesian optimization approach. The accuracy of the coarse-grained model is demonstrated through direct comparisons to results from all atom simulations. We demonstrate the utility of our coarse-graining approach using the block-copolymeric sequence from the exon 1 encoded sequence of the huntingtin protein. This sequence comprises of 17 residues from the N-terminal end of huntingtin (N17) followed by a polyglutamine (polyQ) tract. Simulations based on the CAMELOT approach are used to show that the adsorption and unfolding of the wild type N17 and its sequence variants on the surface of polyQ tracts engender a patchy colloid like architecture that promotes the formation of linear aggregates. These results provide a plausible explanation for experimental observations, which show that N17 accelerates the formation of linear aggregates in block-copolymeric N17-polyQ sequences. The CAMELOT approach is versatile and is generalizable for simulating the aggregation and phase behavior of a range of block-copolymeric protein sequences.

I. INTRODUCTION

Protein sequences with block-copolymeric architectures can aggregate and drive reversible liquid-liquid demixing or sol-gel transitions that give rise to dense liquid or gel-like phases.^1–17 A subset of these sequences also form amorphous or semi-crystalline solids such as amyloid fibers.^1,18–30 The sequence blocks in these proteins are either well-folded domains or intrinsically disordered regions (IDRs). The latter refer to regions that fail to fold into unique three-dimensional structures as autonomous units.^31–34 Accordingly, depending on their architecture, protein sequences that drive aggregation and phase separation may be classified as being intrinsically disordered or partially disordered block-copolymers (Figure 1). Molecular simulations can play a prominent role in uncovering the physical principles that govern the phase behavior of block-copolymeric proteins. Since aggregation and phase separation are driven by collective interactions among large numbers of polymeric molecules, the underlying physics of phase transitions and the demands of computational efficiency mandate the development and deployment of coarse-grained simulations.^35–38 Here, we report the design and implementation of a method for systematic coarse-graining that can be adapted to the two classes of block-copolymeric protein sequences depicted in Figure 1. The algorithm is named CAMELOT and it stands for Coarse-grained simulations Aided by MachinE Learning Optimization and Training. We demonstrate the utility of the CAMELOT method by deploying it in Langevin dynamics simulations of archetypal block-copolymeric sequences that are based on the exon 1 encoded intrinsically disordered region of the huntingtin protein.

FIG. 1. — Example (a) intrinsically disordered block-copolymeric and (b) partially disordered block-copolymeric sequences implicated in aggregation and phase separation. Sequences can be broken up into blocks consisting of ordered domains (circles) or intrinsically disordered regions with compositional biases (rounded rectangles). (a) Htt Exon 1 (UniProt ID: P42858) and Sup35p NM (Uniprot ID: P05453) are archetypal intrinsically disordered block-copolymeric sequences. Htt Exon 1 forms insoluble inclusions that are the pathological hallmarks of Huntington’s disease.⁷⁸ Sup35p NM forms insoluble amyloid fibrils and propagates as a prion.^128,129 (b) NCK1 (Uniprot ID: P16333) and Whi3p (Uniprot ID: Q75E28) are archetypal partially disordered block-copolymeric sequences. NCK1 is involved in the formation of liquid-like micron-sized membraneless organelles important for signal integration.¹³⁰ Whi3p is involved in the formation of functional assemblies important for branching and cell cycle control in filamentous fungi.¹⁸

There are two distinct paradigms for coarse-graining.³⁹ One is based on fixed resolution transferrable models with consensus parameters.^40–43 The CAMELOT method follows a different paradigm whereby system-specific coarse-grained models are designed using information gleaned from prior finer-grained simulations. The general outline of the CAMELOT method is as follows: For the block-copolymeric sequence of interest, we first perform converged all atom simulations aided by enhanced sampling methods to obtain descriptions of conformational ensembles for individual molecules and oligomers. We use the information gleaned from the all atom simulations to prescribe the resolution, forcefield architecture, and parameters for the coarse-grained model. The questions we wish to answer and the length scales that we wish to access in our simulations will dictate the choice of resolution. The parameterization of the model of choice is based on a combination of Boltzmann inversion,^44,45 regression analysis, and a Gaussian process Bayesian optimization approach.⁴⁶ Our design of CAMELOT is guided by the successes of previous methods that include the force-matching algorithm of Voth and coworkers,^47–58 the Yvon-Born-Green formalism of Noid and coworkers,^59–61 and the relative entropy method of Shell and coworkers.^62–64

The remainder of the text is organized as follows: Section II describes the CAMELOT algorithm. Section III describes the numerical methods we use for all atom and coarse-grained simulations. In Section IV, we apply the CAMELOT method in coarse-grained simulations of archetypal block-copolymeric sequences. Our results help to establish the accuracy of the CAMELOT method. Section V demonstrates our ability to answer specific questions regarding the impact of N-terminal flanking sequences as modulators of the aggregation of polyglutamine tracts. Section VI discusses the physical insights that emerge regarding the polyglutamine containing systems and possible generalizations of the CAMELOT approach to studies of aggregation and phase behavior in other block-copolymeric systems.

II. CAMELOT METHOD

A. Choosing the resolutions for coarse-grained simulations

In coarse-grained simulations we wish to achieve the optimal balance of relevance, efficiency, and accuracy. Relevance refers to the fact that aggregation is a multimolecular phenomenon and requires the incorporation of O(10³) molecules in each simulation. However, such simulations need to be efficient while also being accurate. This tripartite balance among relevance, efficiency, and accuracy is not achievable using all atom representations of molecules and this in turn necessitates the use of coarse-grained models. However, the latter can become generic and inaccurate for the specific system in question if the model is not designed to achieve both efficiency and accuracy. Block-copolymeric systems are well suited for coarse-grained simulations because the degree of coupling within and between blocks is distinctly different. The block-copolymeric nature of IDRs emerges from compositional-to-conformational relationships of these sequences.^34,65 These relationships are governed by specific compositional biases such as the fraction of charged residues and the presence or absence of stretches of polar amino acids. Depending on the compositional biases, IDRs can adopt either globular or random-coil-like conformations. In contrast to IDRs, the ordered domains are essentially semi-rigid globules. Accordingly, we deploy two categories of coarse-grained beads whereby the beads mimic either individual residues or groups of residues (domains) that are modeled as colloidal beads.

A residue bead represents a single amino acid residue that is centered on its center-of-mass. In the CAMELOT model, there are three distinct groups of residue beads. The charged amino acids (Arg, Asp, Glu, Lys) belong to group-Ch, the strongly interacting residues (Asn, His, Ile, Leu, Gln, Met, Phe, Trp, Tyr, Val) belong to group-S, and the weakly interacting amino acids (Ala, Cys, Gly, Pro, Ser, Thr) belong to group-W. We derived these groupings from the amino acid solubility data of Auton and Bolen.⁶⁶ These data quantify the free energy changes associated with the transfer of an amino acid from an aqueous solution to a crystalline state. The solubility data are relevant because they capture the competition that characterizes the transfer of proteins from their dispersed states in aqueous solutions to aggregates that are rich in protein-protein as opposed to protein-solvent interactions.

We use colloidal beads for groups of residues that correspond to ordered globular domains or IDRs that adopt globular conformations. We capture the effects of internal structures within colloidal beads using suitable inter-particle potentials that we glean from all atom simulations of pairwise associations of ordered domains and/or IDRs. The use of colloidal beads reduces the number of beads in the system enabling the inclusion of O(10³) to O(10⁴) molecules in coarse-grained simulations. This allows us to access length scales that are directly relevant to experimental studies of aggregation and phase separation of block-copolymeric sequences.

B. Effective energy functions for coarse-grained simulations

The effective energy functions for CAMELOT based coarse-grained simulations take the form

W_{eff} = W_{b} + W_{θ} + W_{ϕ} + W_{L J} + W_{e l} + W_{C} .

(1)

In Equation (1), the terms on the right hand side, respectively, correspond to bond length, bond angle, dihedral angle, Lennard-Jones, electrostatic, and colloidal bead interaction potentials. The choice of the resolution for coarse-grained simulations fixes the terms in Equation (1). If each residue is modeled as a single bead, then W_C = 0. Conversely, if every molecule is modeled as a single colloidal bead, then all terms except W_C become zero. All of the terms in Equation (1) are part of the effective energy function for hybrid resolutions that include residue specific beads and domain specific colloidal beads that correspond to groups of residues. The effective energy function is parameterized against data gathered from all atom simulations that are obtained at a specific temperature T. The specific parameters for W_eff are, therefore, valid for the particular temperature T at which the all atom simulations are performed. Each of the terms in Equation (1) is discussed in detail below.

1. Bonded interactions

Flexible bonds connect pairs of consecutive beads and there are preferred values for the lengths of these bonds and the angles that define the junctions at pairs of bonds. In terms of N_b and N_θ, the total number of bonds and bond angles in the system, the bond length and bond angle terms in Equation (1) are written as

\begin{matrix} W_{b} = \sum_{i = 1}^{N_{b}} \frac{K_{i} {(b_{i} - b_{0 i})}^{2}}{2}, \\ W_{θ} = \sum_{i = 1}^{N_{θ}} \frac{L_{i} {(θ_{i} - θ_{0 i})}^{2}}{2} . \end{matrix}

(2)

In Equation (2), K_i and L_i are the force constants that quantify the penalties associated with deforming bonds and bending bond angles beyond the corresponding equilibrium values of b₀_i and θ₀_i. The energies associated with rotations around bonds that connect four consecutive beads are defined in terms of the dihedral angle potential, which is a Fourier series of the form

W_{ϕ} = \sum_{i = 1}^{N_{ϕ}} \sum_{n = 1}^{3} V_{n i} [1 - cos (n ϕ_{i} - ϕ_{n i})] .

(3)

In Equation (3), N_ϕ refers to the number of dihedral angles, n is the number of terms in the Fourier series, V_ni is the amplitude, and ϕ_ni is the phase.

2. Lennard-Jones interactions

The W_LJ term, which is calculated over non-bonded pairs of beads, is written as

W_{L J} = \sum_{i = 1}^{N_{n b}} \sum_{j < i} 4 ε_{i j} [{(\frac{σ_{i j}}{r_{i j}})}^{12} - {(\frac{σ_{i j}}{r_{i j}})}^{6}], r_{i j} < r_{vdW}^{(i j)} = 0, r_{i j} \geq r_{vdW}^{(i j)} .

(4)

In Equation (4), N_nb is the number of interacting beads for the Lennard-Jones potential, r_ij is the distance between beads i and j, σ_ij is the distance at which the inter-bead potential is zero, and ε_ij is the strength of the interaction. The set of free parameters for W_LJ is chosen to be p_vdW ≡ [σ_ChCh, ε_SS, ε_SC₁, ε_SC₂, …, ε_{SC_N}, ε_WW, ε_WC₁, ε_WC₂, …, ε_{WC_N}]. Here, the subscript Ch refers to residue beads with a net charge, S refers to strongly interacting residue beads, W refers to weakly interacting residue beads, and C_N refers to colloidal beads of type N. All other parameters for the W_LJ term are prescribed either a priori or are determined by Lorentz-Berthelot mixing rules. The ε-values for the interactions between pairs of charged residue beads or between charged residue beads and colloidal beads are set to be 0.01 kcal/mol. The σ-value for beads corresponding to neutral residues is set to be equal to the average radii of gyration (R_g) values of these residues in the all atom simulations. The σ-values for interactions between any residue bead X and any colloidal bead C are set using the relation

σ_{X C} = \frac{R_{g} (C)}{{(2)}^{1 / 6}} .

(5)

In Equation (5), R_g(C) is the average R_g of the group of residues that corresponds to the colloidal bead, C. For each interaction pair ij, the cutoff is set using the relation $r_{vdW}^{(i j)} = 2.5 σ_{i j} .$

3. Electrostatic interactions

Residue beads that correspond to Lys, Arg, Glu, or Asp will have a net charge of ±1. The net charge of a colloidal bead is the sum over all charged amino acids within the colloid. The interactions, W_el, between beads with excess charge are written in terms of a Yukawa potential,

W_{e l} = \sum_{i = 1}^{n_{c}} \sum_{j < i} \frac{C q_{i} q_{j}}{ε r_{i j}} exp (- \frac{r_{i j}}{l_{D}}) for r_{i j} < r_{c} = 0, for r_{i j} \geq r_{c} .

(6)

Here, n_c denotes the number of beads that have an excess charge, q_i is the charge on bead i, ε = 78 is the dielectric constant, l_D = 10 Å is the Debye length for physiologically relevant ionic strengths, and r_c is the distance past which the mean-field electrostatic interactions are zeroed out.

4. Interactions between colloidal beads

The coarse-grained model for a block-copolymeric sequence may have one or more colloidal beads. These beads are either identical to or different from one another. The number and types of colloidal beads are specified by the sequence encoded conformational properties of the block-copolymeric sequence of interest. The collective interactions among colloidal beads are modeled using the W_C term in Equation (1), where $W_{C} = \sum_{i = 1}^{n_{t}} \sum_{k = 1}^{g_{i}} \sum_{j = i + 1}^{n_{t}} \sum_{l = 1}^{g_{j}} w_{C} (r_{k}^{(i)}, r_{l}^{(j)})$ . Here, n_t is the number of distinct types of colloidal beads, g_i is the number of colloidal beads of type i, and $r_{k}^{(i)}$ is the position vector of bead k of type i. The functional form for w_c is not defined a priori but is determined using data gathered from simulations of pairs of colloidal beads represented in all atom detail. An example of the implementation of our approach for defining w_c is described in Subsection IV B for the sequence of an archetypal block-copolymer.

C. Parameterization of uncoupled W_eff terms: W_b, W_θ, W_ϕ, and w_C

We use a Boltzmann inversion procedure^44,45 to extract parameters for uncoupled terms within W_eff. For a given observable, such as the bond length x, the observed probability distribution from all atom simulations is written as ρ(x) ∝ exp[ − W(x)/k_BT]. Here, ρ(x) is the probability density associated with the observable x, W(x) is the effective potential in terms of the observable, k_B is the Boltzmann constant, and T is the simulation temperature.

Since W_b and W_θ are modeled using harmonic potentials, inversion of the Boltzmann relationship yields the analytical relationships for the parameters of the potentials in terms of the first and second moments, viz., $〈b_{i}〉$ , $〈θ_{i}〉$ , $〈b_{i}^{2}〉$ , and $〈θ_{i}^{2}〉$ of the relevant probability distributions,

\begin{matrix} b_{0 i} = 〈b_{i}〉, K_{i} = \frac{k_{B} T}{〈b_{i}^{2}〉 - {〈b_{i}〉}^{2}}, \\ θ_{0 i} = 〈θ_{i}〉, L_{i} = \frac{k_{B} T}{〈θ_{i}^{2}〉 - {〈θ_{i}〉}^{2}} . \end{matrix}

(7)

Parameters for W_ϕ and w_C are obtained via non-linear regression analysis by fitting W_ϕ and w_C to the equation

W (x) = - k_{B} T ln [P (x)] .

(8)

In Equation (8), P(x) is either the probability distribution for each dihedral angle for W_ϕ or the pair correlation function between centers-of-mass of grouped residues that correspond to a given pair of colloidal beads for w_C.

D. Parameterization of W_LJ

1. Primary objective function

As described in Subsection II B 2, the set of free parameters for W_LJ is chosen to be p_vdW ≡ [σ_ChCh, ε_SS, ε_SC₁, ε_SC₂, …, ε_{SC_N}, ε_WW, ε_WC₁, ε_WC₂, …, ε_{WC_N}]. Given this set of free parameters, we seek values for p_vdW that minimize the objective function Ω(p_vdW), which is defined as

Ω (p_{vdW}) = 1 - \frac{1}{m} \sum_{i = 1}^{m} (\frac{2 - \sum_{j = 1}^{n_{bin}} |ρ_{i j}^{AA} - ρ_{i j}^{CG} (p_{vdW})|}{2}) .

(9)

Here, m is the number of distinct inter-bead distances in the coarse-grained model that correspond to pairs of beads that are separated by at least five bonds. Data for the histograms that quantify the probability densities for distances between pairs of interacting beads are recorded in simulations based on the all atom (AA) and coarse-grained (CG) models. For a pair of beads designated by a single index i, the densities within each of the bins j of the corresponding histograms are denoted as $ρ_{i j}^{AA}$ and $ρ_{i j}^{CG}$ . The optimal parameters for p_vdW should minimize Ω(p_vdW) and hence maximize the collective overlap among m pairs of histograms where m = (n − 5)(n − 4)/2. Here, n denotes the number of beads in the coarse-grained model.

2. Auxiliary objective function

The design of the primary objective function assumes that the overlap between probability densities of distances between all atom and coarse-grained models should be weighted equally for all pairs of beads. However, depending on the question of interest, the overlap of distributions for certain pairs of beads may be more important than others. In addition, it might be important to include observables that go beyond distance distributions. In these situations, an auxiliary objective function should be used in a post-processing step to refine some or all of the p_vdW parameters.

3. Gaussian process Bayesian optimization procedure for obtaining p_vdW

Minimization of Ω(p_vdW) is non-trivial for two reasons: The objective function does not have an analytical form. Therefore, we cannot readily use gradient-based methods for optimization of Ω(p_vdW) to predict the optimal values of p_vdW. Second, the objective function is expensive to evaluate because the calculation of Ω(p_vdW) requires that we perform Langevin dynamics (LD) simulations of individual molecules in the coarse-grained representation to calculate $ρ_{i j}^{CG}$ . Bayesian optimization methods are well suited for such problems. These methods are efficient because they use prior knowledge that is generated during the optimization process to direct the sampling in high-dimensional spaces. Bayesian methods strike an optimal balance between exploration (sampling the parameter space, albeit with high uncertainty regarding the objective function) and exploitation (using prior knowledge and sampling where the objective function is likely to be minimized).

Our primary goal is to identify values of p_vdW that minimize the primary objective function Ω. In order to minimize Ω, we use a Bayesian optimization procedure. This requires the collection of observations O_1:t = [(p_vdW,1:tΩ_,1:t)]. Here, O_1:t denotes the set of consecutive observations [O₁, …, O_t] from sampling the multidimensional space of parameters. The accumulation of observations is used to generate the likelihood function, P(O_1:t|Ω). The likelihood function is combined with the prior distribution P(Ω) to generate the posterior distribution, P(Ω|O_1:t), for the unknown objective function Ω. The posterior probability for obtaining the desired low value of Ω given a collection of observations O_1:t is calculated using Bayes theorem,

P (Ω | O_{1 : t}) \propto P (O_{1 : t} | Ω) P (Ω) .

(10)

In order to use Bayesian optimization, we need a model to estimate Ω. Here, we model Ω using a Gaussian process, which is a distribution of objective functions that is defined by its mean and covariance functions. A Gaussian process predicts the most likely values for objective functions and the uncertainties in the estimates for the most likely values. Each Gaussian process Bayesian optimization (GPBO) iteration involves three steps: (i) The algorithm utilizes the posterior distribution and the expected improvement acquisition function to determine the next set of p_vdW values. (ii) Given a choice for p_vdW, we perform LD simulations for a single chain based on the coarse-grained model. (iii) Data from these simulations are used to calculate Ω(p_vdW) and the posterior distribution is updated based on the Gaussian process.⁶⁷

Depending on the complexity of the system, the GPBO procedure can be run in parallel with independent trials to ensure convergence to the same parameter space and/or over generations in which each subsequent generation utilizes information from the pervious generation to shrink the parameter search space. An example of such a hierarchical approach is described for an example archetypal system in Subsection IV D 3. In general, a single generation, within which hundreds of GPBO iterations are conducted, appears to be reasonable for most systems containing fewer than seven parameters.

III. SIMULATION SETUP

A. All atom simulations

All atom simulations were performed using the CAMPARI modeling package (http://camapri.sourceforge.net) utilizing the ABSINTH implicit solvation model^68–71 and forcefield paradigm. This model is accurate—as measured against experimental data—and efficient for all atom simulations of conformational properties and intermolecular associations of intrinsically disordered proteins.⁷² In this work, the all atom simulations were based on the abs3.2_opls.prm parameter set. Additional details of the all atom simulations are provided in the supplementary material.⁷³

B. Coarse-grained Langevin dynamics simulations

We present the details of the LD simulations in Cartesian space that are used to guide the parameterization of the W_LJ terms and for simulations that use the optimized coarse-grained model. All LD simulations were performed using the LAMMPS simulation package (http://lammps.sandia.gov). In these simulations, the Langevin equation shown in Equation (11) is integrated to propagate the positions and velocities of the residue specific and/or domain specific colloidal beads. The force on each of the beads labeled i is written as

F_{i} = - \nabla W_{eff, i} - \frac{m_{i}}{γ_{i}} v_{i} + R_{i} .

(11)

In Equation (11), F_i is the force exerted on bead i; it is a sum of the negative gradient of the effective interactions specified by W_eff, the frictional force proportional to the velocity of the bead i, v_i, and the random force R_i exerted by collisions of the beads with the bath. The random forces have a white noise spectrum in accord with the fluctuation dissipation theorem. The equation of motion is integrated using a velocity Verlet algorithm using an integration time step of 2 fs. The damping term is written as $γ_{i} = α γ_{i}^{'}$ . Here, α = 2 ps is a scaling factor and $γ_{i}^{'} = (\frac{m_{i}}{6 π η R_{g}^{(i)}}) γ_{LYS}^{- 1}$ , m_i is the mass of bead i, η = 6.29 × 10⁻⁴ kgm⁻¹ s⁻¹ is the viscosity of water at 315 K, $R_{g}^{(i)}$ is the average radius of gyration of bead i as calculated from data based on all atom simulations, and $γ_{LYS}^{- 1}$ is the inverse of the damping parameter for the lysine bead that is calculated using the Stokes-Einstein relationship. LD simulations of individual coarse-grained molecules were initiated by mapping equilibrated conformations drawn from the all atom simulations to the coarse-grained model.

For multi-chain simulations at finite concentrations, there are at least O(10³) coarse-grained molecules in each simulation cell. These simulations were performed in the canonical ensemble at constant concentrations, which are controlled using periodic boundary conditions. We initiate the simulations by drawing an initial conformation at random from the equilibrated ensembles of all atom simulations and replicating them on three-dimensional lattices. This is followed by energy minimization using steepest descent and a subsequent equilibration based on 10⁶ steps of LD simulations performed with time steps of 2 fs. Each final simulation involves 10⁸ integration steps and for every combination of sequence and peptide concentration we performed multiple independent simulations.

IV. APPLICATION OF CAMELOT FOR SIMULATIONS OF BLOCK-COPOLYMERS WITH POLYGLUTAMINE TRACTS

Several proteins with polyglutamine tracts or glutamine-rich regions have been identified as drivers of aggregation and phase separation.^74–76 The translation of genes with expanded CAG trinucleotide repeats leads to proteins with expanded polyglutamine (polyQ) tracts.⁷⁷ These form insoluble inclusions that are the pathological hallmarks of several neurodegenerative diseases including Huntington’s disease.⁷⁸ Water is a poor solvent for homopolymeric polyQ tracts as well as polypeptide backbones.^79,80 Accordingly, individual polyQ molecules form globular structures to minimize the chain-solvent interface. For polyQ molecules of a particular chain length, there exists a well-defined saturation concentration (c_s) that defines the boundary between soluble and insoluble phases.⁸¹ The insoluble phase is enriched in fibrillar aggregates and the measured values of c_s decrease with increasing polyQ length. Additionally, for concentrations below c_s, there exists a second saturation concentration (c_c), which corresponds to the formation of spherical aggregates. Kinetics experiments initiated from fully disaggregated solutions that are supersaturated with respect to c_s indicate the early formation of spherical aggregates, 10-30 nm in size, that are precursors of fibrillar aggregates. Fibril formation is barrier-limited and appears to proceed via nucleated conformational conversion within liquid-like spheres^82–84 in accord with the mechanism for crystallization that was proposed by ten Wolde and Frenkel.⁸⁵ In contrast, the formation of metastable liquid-like spheres does not involve discernible free energy barriers.⁸¹

Flanking sequence modules modulates the driving forces for and mechanisms of polyQ aggregation.^{76,81,86–93} This has been observed for the N-terminal stretch of huntingtin. For a given polyQ tract, the presence of N17, the 17-residue N-terminal flanking sequence module, leads to a lowering of c_s vis-à-vis the values measured for polyQ tracts. N17 narrows the gap between c_s and c_c and decreases the metastability of spherical aggregates. Accordingly, N17 helps accelerate the formation of fibrillar aggregates.^94,95 For a given polyQ length, fibril formation proceeds without a discernible lag time if N17 is appended N-terminally to the polyQ tract. The curious effects of N17 are attributable to a domain cross talk between N17 and polyQ.^96,97 According to this model, the N17 module adsorbs and unfolds on the surface of the polyQ domain. This engenders a patchy colloid^98–108 architecture for the N17-polyQ block-copolymer. In direct analogy with the physics of patchy colloids, the presence of an adsorbed N17 patch, with charged groups exposed on the surface, leads to a diminution of non-specific interactions of polyQ molecules.^96,97 The patch on the colloidal particle breaks the spherical symmetry and imparts directional preferences to intermolecular encounters, thus promoting a distinct preference for linear aggregates. Here, we deploy our CAMELOT approach to test the applicability of the patchy colloid model for explaining the N17 enhanced formation of linear aggregates in block-copolymer sequences with polyQ tracts.

A. Determining the coarse-grained resolution from all atom simulations

1. Sequences of interest

For simplicity, the N-terminal and polyQ blocks are denoted as N- and Q-blocks, respectively (see Figure 2). We used sequence design to generate different types of N-block sequences. The different N-block sequences (see Figure 2) can be distinguished by the degree of adsorption between the N- and Q-blocks as shown in Figure 3. Here, we use data from all atom simulations to organize the different N-Q sequences along the ordinate of Figure 3 in ascending order of the degree of adsorption (d_A) between different N-blocks and a Q₄₀-block. The parameter d_A is calculated as

\begin{matrix} d_{A} = (\frac{V_{I}}{V_{N} + V_{Q}}), \\ V_{I} = \frac{π {(R_{g, N} + R_{g, Q} - r_{c})}^{2} (r_{c}^{2} + 2 r_{c} R_{g, Q} - 3 R_{g, Q}^{2} + 2 r_{c} R_{g, N} + 6 R_{g, N} R_{g, Q} - 3 R_{g, N}^{2})}{12 r_{c}} if R_{g, N} + R_{g, Q} > r_{c}, \\ V_{I} = 0 if R_{g, N} + R_{g, Q} \leq r_{c} . \end{matrix}

(12)

In Equation (12), V_I is the volume of the intersection between spherical envelopes corresponding to the N- and Q-blocks. The terms V_N and V_Q in the denominator were calculated using the conformation-specific R_g values for the N- and Q-blocks. In the definition of V_I, r_c is the distance between the centers-of-mass of the N- and Q-blocks whereas R_g,N and R_g,Q are the conformation-specific radii of gyration calculated over the atoms of the N- and Q-blocks, respectively. When d_A = 0, the spherical envelopes of the N- and Q-blocks do not intersect. The maximal degree of adsorption between a 17-residue N-terminal stretch and a globular polyQ domain is ∼0.4.

FIG. 2. — N-Q sequences used for this study. N-block denotes different 17-resiude N-terminal sequences and Q-block denotes a polyQ tract with 40 Gln residues. N17 denotes the wild type sequence (UniProt ID: P42858). Sequences of the N-block were designed in order to modulate the amino acid sequence (N17_W and N17_S) or to modulate the amino acid composition (E(KE)₈). Designs that modulate amino acid sequence maintain the wild type N17 composition but scramble the sequence. The choice of the polyampholytic sequence E(KE)₈ allowed for examination of N-block properties that could not be accessed using the N17 composition. In the sequences, hydrophobic residues are in black, polar residues are in green, and positively and negatively charged residues are shown in blue and red, respectively.

FIG. 3. — Degree of adsorption (d_A) to Q₄₀ versus the probability of forming N-block dimers for each N sequence. Results were extracted from at least 3 independent all atom simulations of monomers and monomer-dimer equilibria simulations, respectively. The dotted orange line quantifies the probability of Q₄₀ forming dimers. Dimers are defined by any two residues of differing molecules that are less than or equal to 3.5 Å apart. Insets show representative snapshots for each N-Q sequence. The black and orange translucent spheres correspond to the R_g’s of the N- and Q-blocks, respectively. The overlap between translucent spheres serves as a visualization of d_A.

The values for d_A were calculated from ABSINTH-based all atom simulations of monomeric variants of different N-Q sequences. The abscissa in Figure 3 corresponds to the probabilities of forming dimers of N-blocks as autonomous units. These results were also extracted from ABSINTH-based all atom simulations with pairs of N-block molecules. The results summarized in Figure 3 make several points: The N-blocks show negligibly low likelihoods for self-interactions when compared to the high probability of self-associations between pairs of Q₄₀ molecules. By scrambling the sequence of N17, we were able to design sequences that adsorb more strongly (N17_S) and more weakly (N17_W) to the polyQ domain when compared to the sequence of N17 that is drawn from the N-terminus of the huntingtin protein. Since a value of d_A = 0 cannot be achieved with the composition of N17, we designed a synthetic polyampholytic sequence, Glu-(Lys-Glu)₈ denoted as E(KE)₈, that helps us to achieve zero adsorption between the N- and Q-blocks. The four N-Q sequences allow us to titrate the effects of varying d_A on the sequence-encoded bias toward linear aggregates.

2. Choice of resolution

To test the hypothesis that the degree of adsorption between N- and Q-blocks modulates the bias towards linear aggregates, we need to understand the interplay between sequence-specific properties of flanking sequences and polyQ-mediated aggregation. Specifically, we seek a model that maintains sequence-specific properties of the flanking sequences while enabling simulations of O(10³) molecules in order to study the early stages of polyQ-mediated aggregation.

Experimental data and atomistic simulation results show that polyQ constructs adopt globular conformations in aqueous solutions.⁷⁹ Previous simulations have shown that R_g = R₀N^ν, for polyQ tracts. Here, R_g refers to the ensemble averaged radii of gyration, ν = 0.33 for globules, and the pre-factor is R₀ = 3.0 Å. We analyzed the results from all atom simulations for T = 315 K to establish that the Q-block adopts globular conformations in the context of N-Q sequences. Further, in all of the N-Q constructs, the values of R_g calculated over the Q-blocks are concordant with a scaling exponent of ν = 0.33.

Since the Q-block maintains its preference for globular conformations in all N-Q sequences examined here, we used the following architecture for the coarse-grained model: the residues of the N-block are modeled as residue beads whereas the residues of the Q-block are lumped together as a single colloidal bead (Figure 4). This choice allows for sequence-specific properties of the flanking sequences to be maintained while enabling significant computational efficiency in simulations of O(10³) N-Q molecules.

FIG. 4. — Architecture of the coarse-grained model compared to the all atom model for N17-Q₄₀. For the coarse-grained model, each residue in the N-block is modeled as a single bead, whereas all the residues of Q-block are modeled as a single colloidal bead. Here, Gln residues are in orange, hydrophobic residues are in black, positively and negatively charged residues are in blue and red, respectively, and other polar residues are in green.

In order to justify our choice of resolution, we deployed a network-based approach to provide an unbiased assessment of groups of residues that interact preferentially among themselves as opposed to interacting across groups. The idea is that residues that prefer to interact among themselves in the all atom simulations can be treated as a single entity upon coarse-graining. In network-based approaches, such groups are referred to as communities.^109,110 The overall strategy, which is adapted from the work of Sethi et al.,¹¹¹ is as follows: each conformation drawn from the ensemble of conformations generated in an atomistic simulation is converted to a network of nodes and edges. The nodes are positions of C_α atoms and edges are drawn between pairs of nodes that are within 13 Å of each other. Communities based on these networks are determined using the Girvan-Newman algorithm combined with the network modularity score, Q. The final community structure for each conformation (network) is taken as the sub-network that yielded the highest Q-score.⁷³

Panels (a)–(d) in Figure 5 summarize the results obtained from analysis of the community structures for all N-Q constructs. Specifically, each panel plots the probability that two residues are observed to be in the same community. Hotter colors imply that two residues have a higher probability of being part of the same community, whereas cooler colors imply that two residues have a low probability of being part of the same community. For E(KE)₈-Q₄₀, N17_W-Q₄₀, and N17-Q₄₀, the community analysis clearly shows that distinct communities involve residues within the N- and Q-blocks, respectively. Figure 6 provides further quantitative evidence in support of the choice of the coarse-grained model. Here, we quantify the probability that a residue from block X (N- or Q-) belongs to the same community as a residue from block Y (N- or Q-). Even though the coupling increases between the N- and Q-blocks for sequences such as N17_S-Q₄₀, residues prefer to be in communities with other residues that are part of the same block. Given the low probabilities for N- and Q-block residues to be part of the same community and the lack of evidence for sub-communities within the Q-block, the coarse-grained resolution in which each N-block residue is modeled as a residue bead and the Q-block is modeled as a colloidal bead is justified for the N-Q constructs.

FIG. 5. — Community analysis for each of the N-Q sequences. Conformations from all atom simulations were converted into networks wherein each C_α position was considered a node and a weighted edge *e_w* = *d_ij* was drawn between two nodes if the distance between the C_α positions of residues i and j, *d_ij*, was less than or equal to 13 Å. Communities based on these networks were determined using the Girvan-Newman algorithm combined with the network modularity score, Q. The final community structure for each conformation (network) was taken as the sub-network that yielded the highest Q. Panels (a)-(d) show the probability that any two residues are in the same community for E(KE)₈-Q₄₀, N17_W-Q₄₀, N17-Q₄₀, and N17_S-Q₄₀, respectively. The bottom left corner of each plot corresponds to probabilities of intra-N communities (residue numbers 1-17), whereas the top right corner of each plot corresponds to the probabilities of intra-Q communities (residue numbers 18-57). For E(KE)₈-Q₄₀, N17_W-Q₄₀, and N17-Q₄₀, it is clearly shown that intra-block communities are more favorable (white or blue colors on the plots versus hotter colors for intra-N and intra-Q communities). Additionally, within the N- and Q-blocks there are no well-defined sub-communities.

FIG. 6. — Average probabilities that a residue from block X is in the same community as a residue from block Y. Here, block X and block Y can refer to either N- or Q-blocks. The first two columns show the average intra-block probabilities for each N-Q sequence and the third column shows the inter-block probabilities. For all N-Q sequences, the intra-block probabilities are greater than the inter-block probabilities (hotter versus cooler colors). This implies that even as the coupling increases between residues across the N- and Q-blocks, residues still prefer to be in communities with other residues within in the same block.

B. Effective energy function for coarse-grained simulations

The effective energy function for the coarse-grained model of N-Q constructs is given by Equation (1). In the current example, w_C denotes the colloidal potential between Q-blocks. We used data from 18 independent all atom simulation for pairs of Q₄₀ molecules to obtain the functional form for the w_C potential. These data yielded a pair correlation function g(r) at the target temperature for the distance r between the centers-of-mass of the polyQ molecules with two minima that correspond to the two categories of interactions between homopolymeric globules, viz., docking and entanglement. The functional form of w_C is a sum of two terms, a Mie potential (w_E) to model entanglements and a Gaussian potential (w_D) to model the docking of globules.¹¹² Explicitly,

\begin{matrix} w_{C} = w_{E} + w_{D}, \\ w_{E} = (\frac{γ_{r}}{γ_{r} - γ_{a}}) {(\frac{γ_{r}}{γ_{a}})}^{(\frac{γ_{a}}{γ_{r} - γ_{a}})} ε_{E} [{(\frac{σ_{E}}{r})}^{γ_{r}} - {(\frac{σ_{E}}{r})}^{γ_{a}}], \\ w_{D} = - ε_{D} exp [- \frac{{(r - r_{d})}^{2}}{2 δ_{d}^{2}}] . \end{matrix}

(13)

In Equation (13), ε_E is the well depth of the entanglement potential, σ_E is the distance r between the colloidal beads for which the entanglement potential is zero, ε_D is the well depth of the docking potential, r_d is the inter-particle separation at which the docking potential is minimized, and δ_d controls the width of the well.

C. Parameterization of uncoupled W_eff terms: W_b, W_θ, W_ϕ, and w_C

The target potential of mean force w_C_,target(r) was obtained by Boltzmann inversion of g(r) such that w_C_,target(r) = − k_BTln[g(r)]. This potential of mean force, extracted from the all atom simulations, was used in a non-linear regression procedure to obtain the parameters for w_c in the coarse-grained model. The values of γ_rep and γ_att were set be equal to 6 and 2, respectively. The final parameters from the regression analysis are as follows:

\begin{matrix} ε_{E} = 3.72 {kcal-mol}^{- 1}, \\ σ_{E} = 0.32 R_{g} Å, \\ ε_{D} = 3.92 {kcal-mol}^{- 1}, \\ r_{d} = 1.78 R_{g} Å, \\ δ_{d} = 5.94 Å. \end{matrix}

(14)

In Equation (14), R_g refers to the ensemble-averaged radius of gyration for monomeric forms of Q₄₀ molecules that we calculate from all atom simulations. Figure 7 shows the potential of mean force extracted from all atom simulations of pairs of Q₄₀ molecules at 315 K—the target potential w_C_,target(r)—and a comparison of the fit to this potential obtained via non-linear regression analysis that leads to the parameters summarized in Equation (14).

FIG. 7. — Interaction potential, *w_C*, for pairs of Q₄₀ molecules. The potential of mean force for pairs of Q₄₀ molecules was extracted from 18 independent all atom simulations at 315 K (orange curve) and was fit to Equation (13) using a non-linear regression procedure (black curve); *w_C* captures the two modes of interactions observed for pairs of Q₄₀ molecules, namely, entanglement and docking. Representative snapshots of entanglement and docking states are shown as insets. In these snapshots, Q₄₀ molecules are shown in atomic detail in orange and grey. Additionally, translucent spheres that correspond to the radius of gyration, *R_g*, of each molecule are drawn as a visual aid to distinguish between entanglement (large overlap between spheres) and docking (limited, if any, overlap between spheres) states.

Parameters for the bonded and dihedral angle terms were also determined using the Boltzmann inversion procedure. Specifically, for each N-Q construct, we first calculated the positions of the centers-of-mass for residues within the N-block and over all of the residues within the Q-block. The simulation results at 315 K were used to extract the probability distributions for each of the bond lengths, bond angles, and dihedral angles that define the coarse-grained model. The all atom probability distributions were then used to generate the bond length, bond angle, and dihedral angle parameters as described in Subsection II C.