Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2023 May 19;63(11):3288–3306. doi: 10.1021/acs.jcim.3c00460

Interpretable Machine Learning Models for Phase Prediction in Polymerization-Induced Self-Assembly

Yiwen Lu , Dilek Yalcin ‡,§, Paul J Pigram §, Lewis D Blackman ‡,*, Mario Boley †,*
PMCID: PMC10268968  PMID: 37208794

Abstract

graphic file with name ci3c00460_0009.jpg

While polymerization-induced self-assembly (PISA) has become a preferred synthetic route toward amphiphilic block copolymer self-assemblies, predicting their phase behavior from experimental design is extremely challenging, requiring time and work-intensive creation of empirical phase diagrams whenever self-assemblies of novel monomer pairs are sought for specific applications. To alleviate this burden, we develop here the first framework for a data-driven methodology for the probabilistic modeling of PISA morphologies based on a selection and suitable adaption of statistical machine learning methods. As the complexity of PISA precludes generating large volumes of training data with in silico simulations, we focus on interpretable low variance methods that can be interrogated for conformity with chemical intuition and that promise to work well with only 592 training data points which we curated from the PISA literature. We found that among the evaluated linear models, generalized additive models, and rule and tree ensembles, all but the linear models show a decent interpolation performance with around 0.2 estimated error rate and 1 bit expected cross entropy loss (surprisal) when predicting the mixture of morphologies formed from monomer pairs already encountered in the training data. When considering extrapolation to new monomer combinations, the model performance is weaker but the best model (random forest) still achieves highly nontrivial prediction performance (0.27 error rate, 1.6 bit surprisal), which renders it a good candidate to support the creation of empirical phase diagrams for new monomers and conditions. Indeed, we find in three case studies that, when used to actively learn phase diagrams, the model is able to select a smart set of experiments that lead to satisfactory phase diagrams after observing only relatively few data points (5–16) for the targeted conditions. The data set as well as all model training and evaluation codes are publicly available through the GitHub repository of the last author.

Introduction

Amphiphilic block copolymer self-assemblies have a range of uses in drug delivery and nanomedicine, in diagnostics, as nanocomposites, in nanofabrication and electronics, and as stimuli-responsive flocculants.17 Over the past two decades, polymerization-induced self-assembly (PISA) has become a popular synthetic route toward such assemblies formed in solution, owing to numerous advantages over conventional self-assembly processes, such as the ability to perform self-assembly at high solids contents using a one-pot synthesis under mild aqueous (or pure solvent) conditions.814 In a typical PISA synthesis, a soluble polymer block is chain extended in a selective solvent using a monomer that once polymerized becomes insoluble in the solvent (Figure 1A). This insolubility drives the self-assembly of the resulting block copolymer to form nano-objects that are stabilized by the soluble block, thereby preventing macroscopic precipitation.

Figure 1.

Figure 1

Outline of the aqueous RAFT PISA process and resulting phase diagrams. (A) Illustration of morphology evolution during PISA, starting from the soluble corona polymer block, with core monomer dispersed or emulsified in the aqueous solvent. As the polymerization proceeds, an amphiphilic block copolymer is formed. At a critical core length, self-assembly into spherical micelles occurs. Typically, these spheres phase transition into worms, then into vesicles, as the core length grows further. Inset i shows an example RAFT PISA starting from a poly(cysteine methacryate) (PCysMA) corona polymer and 2-hydroxypropyl methacrylate (HPMA) core monomer, to form a PCysMA-b-PHPMA diblock copolymer. R and Z groups are generic end groups resulting from the RAFT agent of choice. Inset ii shows the different internal structures of spheres, worms, and vesicles. (B) Experimental empirical phase diagram for PCysMA31-b-PHPMAy at 70 °C, where s, w, and v, denote spheres, worms, and vesicles, respectively (adapted with permission from Ladmiral and Armes et al.15 Copyright 2015 Royal Society of Chemistry). (C) Example of an algorithm derived probabilistic phase diagram for PCysMA31-b-PHPMAy at 70 °C using the proposed framework (with random forest model). Points surrounded by a diamond indicate training data actively selected by the model for this monomer combination.

One overarching complication with block copolymer self-assembly in solution is that without careful introduction of the selective solvent, such structures do usually not exist in thermodynamic equilibrium,16 so predicting their phase behavior directly from experimental design remains very challenging. To date this problem is typically addressed by the construction of empirically derived phase diagrams for individual PISA systems that can be used to predict the morphologies of future similar PISA syntheses (Figure 1B).17 However, this approach neither allows predictions for new monomer-pair classes and/or different conditions, nor does it provide uncertainty estimates that could be used for a systematic active exploration of the phase space. Most importantly, compared with potential in silico methods, this is a time and labor-intensive process that limits the de novo design of PISA systems for novel applications. Full ab initio molecular dynamics are not available as a remedy since, with current computational means, such methods are inapplicable to PISA owing to its extreme complexity. The final morphological outcome is determined by simultaneous polymerization of several hundred monomers to form each polymer chain and the self-assembly of the growing chains into particles, each containing up to thousands of chains, as well as the interactions of the core and corona components with each other and with the surrounding solvent. Hence, less expensive computational approaches are currently explored that combine statistical methods with simpler ab initio computed features that capture only part of the PISA process. In particular, LogP/SA features of the resulting polymers have been used to predict the suitability of core monomers to undergo PISA in aqueous solutions.18,19 While this approach could potentially be extended to predict morphological outcomes, it still relies on relatively expensive computational features, which limits its applicability to new monomers and experimental conditions.

Herein, we develop the first computational framework for predicting morphological outcomes in PISA based solely on the experimental conditions and inexpensive features of the involved monomers that can be obtained through databases or trivial computations. In particular, we provide a statistical machine learning approach to probabilistically model the full resulting phase description of a PISA setup. The resulting probabilities can be used to draw computer-generated phase diagrams as well as to guide the experimental design to the acquisition of data points that are most informative for refining the phase boundaries initially provided by the model (see Figure 1C). Moreover, the developed models follow interpretable designs that are expected to work well with relatively small amounts of training data that can be obtained from the PISA literature. In summary, the methods presented here satisfy the following important requirements: (i) they predict PISA morphologies based on simple features of the experimental setup that are inexpensive to compute; (ii) they provide probabilistic uncertainty estimates about their predictions useful for active learning; (iii) they model the joint occurrence probability of the morphologies of interest to adequately account for the potential occurrence of coexistent morphologies; (iv) in addition to showing a good empirical performance, they possess an interpretable syntactic form that allows validation of their conformity with chemical and physical knowledge.

We address these requirements by forming additive logistic models for each target morphology and then combining them to provide calibrated probabilities for the joint morphology vector. The additive design is the key to interpretability, because the models can be analyzed by investigating their individual terms. In particular, we evaluate three additive designs of increasing complexity: linear models, generalized additive models with univariate transformations, and finally rule and tree ensembles where individual terms are given in the form of IF-THEN rules. In contrast to similar studies on the predictive modeling of polymer and solid properties,20,21 here we cannot rely on computational training data generated by ab initio methods owing to their above-mentioned inapplicability. Instead, model fitting is performed through a data set curated from the literature of PISA synthesis results, which is published along with the model training and evaluation code on GitHub. See Data Availability Statement. Using this data, we evaluate the models both in terms of their interpolation ability (predict phases for new conditions of known combinations of core and corona monomers) as well as in terms of their ability to extrapolate to new core/corona combinations. To provide further guidance about the generality of the models and their domain of applicability, we explicitly discuss the data curation rules as well as the descriptive statistics about the experimental conditions present in the training and test data.

In particular, the scope of the present work is reversible addition–fragmentation chain transfer polymerization-induced self-assembly (RAFT PISA) in aqueous solution. However, we envision that through engagement with the wider polymer community, the proposed framework can be extended to a generalized model for PISA using different solvent systems, initiation techniques, and monomer classes, as well as for related emerging techniques. This could include photoRAFT PISA,22,23 photoinduced electron/energy transfer (PET)-RAFT PISA,24 various mechanisms of atom transfer radical polymerization-induced self-assembly (ATRPISA),2527 ring opening metathesis polymerization-induced self-assembly (ROMPISA),11 ring opening polymerization-induced self-assembly (ROPISA),28 ring-opening polymerization-induced crystallization-driven self-assembly (ROPI-CDSA),29 polymerization-induced electrostatic self-assembly (PIESA),30 polymerization-induced thermal self-assembly (PITSA),31 and continuous flow PISA setups.32,33

Results

Statistical Population and Data

While the methodology presented here is potentially applicable to a wide range of PISA setups, in this study, we focus on modeling setups using a single polymerization technique and solvent system: RAFT PISA in aqueous solution. Aqueous RAFT PISA is featured frequently in the literature including in some of the seminal reports of PISA.34,35 From the perspective of potential model utility, it may seem desirable to include more or even all possible PISA systems. However, the actual model utility depends on its accuracy, which is determined by the amount of consistent high-quality sample data that can be obtained from the chosen population. The characterization of morphological outcomes in block copolymer self-assembly, including PISA, is nontrivial and occasionally ambiguous, particularly for mixed morphology systems. The focus on RAFT PISA allowed the curation of a reasonably representative data set of 592 examples of this population by using published results15,17,36−−72 of only a single research team—the Armes group from the University of Sheffield—and, thus, to limit the possibility of characterization inconsistencies. See Methods for a full list of selection criteria.

Extending our work, for example to models that can predict the morphologies obtained across different initiation approaches, solvent systems, or for any monomer conversion, offers promising directions for further research. However, in first instance, it appears sensible to treat these prediction problems independently from each other, as different initiation approaches can give rise to different morphological outcomes.23,73 This is true even for very similar initiation pathways (e.g., photoPISA vs thermal RAFT PISA, and photoRAFT PISA using different light wavelengths), owing to differences in end group fidelity, kinetics, and/or dispersity.23,73

Figure 2 provides an overview of the monomer structures sampled in this study. Note that during chemical representation, all acidic and basic moieties are represented in their ionized forms using either a sodium or chloride counterion where appropriate to balance the charge, unless already specified through the use of defined monomer salts. For corona polymers derived from ring opening polymerization (e.g., poly(ethylene glycol), a diradical ring opened form of the monomer was used owing to its greater resemblance to the polymeric structure (Figure 2).

Figure 2.

Figure 2

Structures of all corona-forming (top) and core-forming (bottom) monomers investigated in this study.

Conceptually, we treat the 592 examples as a random sample from the underlying population of RAFT PISA setups. Owing to its origin from a literature survey, the sampling distribution is not known explicitly and likely far from a space-filling uniform distribution. Figure 3 provides an overview of the monomer combinations, reaction conditions, and resulting morphologies that are present in the sample. The monomer combinations are clearly concentrated at GMA (corona) and HPMA (core) with only relatively few examples that contain neither monomer. Similarly, the morphological outcomes are heavily weighted to pure spheres. Both are important to keep in mind, since all of our model performance estimates are in some form related to this unknown distribution (even those explicitly targeting extrapolation performance to previously unseen monomer combinations).

Figure 3.

Figure 3

Overview of data set. Left: count of monomer combinations in core and corona blocks where components of copolymers are counted proportionally, resulting in 0 counts after rounding for some cross-linkers. Center: counts, empirical probabilities (relative counts), and empirical information entropy of different phases. The table contrasts actual empirical probabilities (p) and naïve probabilities (q) obtained by simply multiplying the individual morphology probabilities; the resulting differences in empirical information entropy H(p) = ∑i=116pi log pi and cross-entropy H(p, q) = ∑i=1pi log qi illustrates the potential loss of information when modeling morphologies individually instead of jointly (see Performance Estimation); last row (gray) gives the column totals where the totals in the first four columns are the marginal totals of the corresponding morphology. Right: histograms of reaction condition using 12 equal-width bins (x-axis in log-scale).

We describe each member of this population by a set of l + p random variables consisting of l target variables Y1, ..., Yl that describe the morphological outcome of the experiment and p covariates (or predictors) X1, ..., Xp that describe the experimental setup consisting of the involved chemical formulations as well as the reaction conditions. Considering the target variables first, we recall that, in PISA, as the chain extension proceeds, the morphologies of the nano-objects typically evolve in a sequential manner from linear chains (where no assembly has yet occurred), into spherical micelles and worm-like micelles and then onto vesicles. Intermediate structures can also form, for example jellyfish and lamellae, typically observed between worms and vesicles. In this study we cluster the morphologies into four binary target variables, indicating the occurrence of spheres (Y1), worm (Y2), vesicles (Y3), and other structures (Y4), respectively (see Methods for the exact clustering rules). However, all presented methods are applicable to a finer or different taxonomy if sufficient training data can be obtained.

Turning to the covariates, we start by distinguishing three groups: (1) covariates that describe the corona and core monomers, (2) covariates that describe the reaction conditions, and (3) covariates that describe the resulting corona and core blocks. In particular, the first group is an important design choice that is crucial to obtain a general model applicable to new monomer combinations. For that, it is insufficient to simply use binary covariates that indicate the presence or absence of certain monomers. Instead, we have to define a generic representation74 (sometimes also referred to as “fingerprint” or “descriptor”) that makes different monomers comparable in a unified metric space. Furthermore, for the practical applicability of the model it is crucial that all coordinates of this representation can be determined without expensive computations.75 Here, we choose a 5-dimensional representation for each, the composition of corona monomers and core monomers, resulting in an overall 10-dimensional representation of all involved monomers. Specifically, we describe an individual monomer with its molar mass, molecular volume, sum of atomic polarizabilities, polar surface area, and finally its logarithmic computed octanol/water partition coefficient. These coordinates are then compositionally averaged between all monomers in the core and the corona, respectively, to form the final representation for the two block compositions.

In contrast to the first group, the second group of covariates is straightforward to define. It contains the quantities that are recorded in the data set: the degree of polymerization for corona and core polymer blocks, concentration of the total polymer mass in the solvent, the pH and salt concentration of the solvent, whether or not the polymer end-group is charged, and the reaction temperature. Finally, the third group consists of derived variables that describe the total molar mass of corona and core polymer blocks, their total molecular volume, and the relative proportions of these features between corona and core block. See Table 1 for a concise list of all covariates and their definitions. While the resulting set of 23 predictors contains some pairs of variables with partial correlation (see Supporting Information), all variables are simply derived and readily understandable, which is crucial for obtaining interpretable models.

Table 1. List of Covariates (Predictive Variables)a.

  abbrev. description unit
X1X6 mw_cna, mw_cre molar mass of corona and core monomers g/mol
X2X7 mv_cna, mv_cre McGowan characteristic molecular volume of corona and core monomers m3/mol
X3X8 psa_cna, psa_cre topological polar surface area of corona and core monomers Å2/mol
X4X9 apol_cna, apol_cre sum of atomic polarizabilities (including implicit H) of corona and core monomers C2 m/N
X5X10 clogp_cna, clogp_cre computed log octanol/water partition coefficient n/a
X11 dp_cna corona degree of polymerization count
X12 dp_cre core degree of polymerization count
X13 conc mass concentration of total core and corona solids in solvent wt %
X14 ph pH of the solvent n/a
X15 salt molar concentration of salt M
X16 charge whether the polymer end group is charged (1) or not (0) n/a
X17 temp reaction temperature °C
X18 mw_tot_cna total molar mass of corona chains, i.e., X18 = X1X11 g/mol
X19 mw_tot_cre total molar mass of core chains, i.e., X19 = X2X12 g/mol
X20 mv_tot_cna total molar volume of corona polymer chains, i.e., X20= X3X11 m3/mol
X21 mv_tot_cre total molar volume of core polymer chains, i.e., X21 = X4X12 m3/mol
X22 mass_ratio mass ratio corona to corona plus core polymer: X22 = X18/(X18 + X19) n/a
X23 vol_ratio volume ratio corona to corona plus core polymer: X22 = X20/(X20 + X21) n/a
a

All quantities in first group (monomer composition representation) are compositionally averaged; e.g., mw_cna is the mean molar mass of all corona monomers weighted by their mixture coefficients.

Model Design

While the main purpose of our models is to provide accurate predictions of the morphological outcomes of a given PISA setup, it is important to acknowledge that such predictions are necessarily uncertain76—owing to both the limited knowledge captured in the training data (epistemic uncertainty) as well as to the inherent unpredictability of the PISA process close to phase boundaries (aleatoric uncertainty). Therefore, we focus in this work on probabilistic models that describe the conditional distribution of the joint morphology occurrences Y = (Y1, ..., Yl) given the experimental setup described by the combined covariates X = (X1, ..., Xp). Formally, such models represent a conditional probability mass function (y|x) that approximates the true conditional probability of observing phase y given covariates x, i.e.,

graphic file with name ci3c00460_m001.jpg 1

Such models naturally provide a phase predictionŷ according to the maximum estimated probability, i.e.,

graphic file with name ci3c00460_m002.jpg 2

Additionally, they also provide a meaningful quantification of uncertainty (the estimated probability of the predicted phase) as well as likely alternative outcomes (other phases with an estimated probability close to that predicted phase). For instance, a probabilistic model might predict for a certain setup that the outcome will be a pure sphere phase with probability 0.5 but also express that a likely alternative is a mixed phase containing spheres and worms with probability 0.4.

To define a probabilistic model over l target variables, one could simply use l separate models, one for each morphology group, and then model the joint probability as product of the individual morphology occurrences. However, this approach, which is sometimes called binary relevance(77) in the machine learning literature, would assume that the different morphologies occur independently from each other. This assumption is violated by the physical assembly process, and its invalidity is also reflected in our sample data (Figure 3, center): for instance, we observe the formation of a pure spheres with a relative count of 0.409 in contrast to 0.248, which would be expected under independence. To be able to capture dependencies between morphology occurrences, we instead employ the approach of probabilistic classifier chains.78 With this approach, the joint conditional probability is modeled using the chain rule of probability as the product

graphic file with name ci3c00460_m003.jpg 3

where the notation Yi–1 = (Y1, ..., Yi–1) refers to the first i – 1 target variables. That is, the conditional probability of morphology i is modeled not only based on the covariates describing the experimental setup but also based on the occurrence of the “preceding” morphologies Y1, ..., Yi–1. Note that the term “preceding” refers to the chosen ordering of the morphologies and does not necessarily reflect the temporal evolution of the nano-objects. In our case, this means that worm is modeled taking into account the occurrence of sphere, vesicle is modeled taking into account the occurrence of sphere and worm, and other is modeled taking into account the occurrence of all three main morphology groups.

For the individual factors in (3), we then choose an additive design in order to facilitate interpretability of the fitted probability functions. That is, following the typical “logistic” design of additive models for binary response variables, we define

graphic file with name ci3c00460_m004.jpg 4

i.e., we model the log-odds of each morphology occurrence as a sum of k terms. With this construction, the desired probabilities can simply be recovered by the sigmoid transformation σ(a) = (1 + exp(−a))−1 as

graphic file with name ci3c00460_m005.jpg 5

such that the overall model can be written as the product:

graphic file with name ci3c00460_m006.jpg 6

The additive design (4) facilitates model interpretation, because we can inspect each of the terms individually to understand the behavior of the whole model—given that the number of terms is not too large and that each term has a simple form that only refers to as few covariates as possible. As concrete choices, we consider three designs of varying complexity (see Figure 4 for an illustration):

  • 1.
    Linear models. With this most basic choice,the model for morphology i consists of p + i – 1 linear terms of the form fj(i)(x,yi–1) = βjzj, where zj denotes the j-th components of the extended covariate vector z = (x, yi–1), and one constant “intercept” term β0. More explicitly the linear model for an individual morphology can be written as
    graphic file with name ci3c00460_m007.jpg 7
    Because of the logistic transform this model is usually referred to as logistic regression in the literature and is considered a generalized linear model. This model is most interpretable, because the effect of every (extended) covariate on the morphology occurrence probability is described by a single real-valued coefficient. On the other hand, it is very restricted because it can only model linear univariate effects on the log odds and does not take into account interaction effects of experimental choices, which are likely important in PISA designs.
  • 2.

    Generalized additive models (GAMs). These models share the basic structure of linear models with one term per covariate but relax the linearity assumption of the individual terms. Instead, each term fj(i)(x,yi–1) = fj(zj) is a univariate transformation of a single extended covariate, typically fitted by a nonparametric interpolation technique such as smoothing splines or tree ensembles. With this model, the effect of a covariate can still be interpreted with relative ease by investigating the graph of the corresponding univariate component functions, which is referred to as partial dependency plot.

  • 3.

    Additive rule ensembles. These models consist of an arbitrary number of terms of the form fj(i)(x,yi–1) = βjqj(x,yi–1) where the functions qj describe a conjunction (logical AND) of binary conditions on the PISA design that can be written as qj(x,yi–1) = ∏j=11(zij ∈ [aij,bij]) with 1(·) denoting the indicator function that takes on value 1 if the condition in parentheses is satisfied and 0 otherwise. Hence, for this model type, an individual term can be considered an IF-THEN rule where the product qj forms the rule conditon (IF-part) describing a rectangular region in the PISA design space by a logical conjunction (AND), and the parameter βj represents the rule consequent (THEN-part), i.e., the value predicted by the rule if the condition holds true.

Figure 4.

Figure 4

Additive model designs. Probability of an individual morphology is modeled via its log odds that, in turn, is given as sum of interpretable terms. Top: Linear model with one term per covariate of the form βjXj. Center: GAM model, which generalizes the linear terms to some univariate transformations fj(Xj). Bottom: Additive rule ensemble, where each term (rule) takes on a constant value (rule consequent) within a rectangular region defined by a subset of covariates (rule condition).

A special case of additive rule ensembles are the popular classification and regression trees as well as ensembles that are built from such trees such as gradient boosting regression trees79 and random forests80 (RF). Each tree leaf in such an ensemble can be considered an individual rule. These ensembles are widely used machine learning tools because of their effectiveness and robustness. However, they typically contain too many rules to practically analyze and interpret them. Therefore, while we do consider random forests as a benchmark, this work primarily focuses on RuleFit,81 a method for distilling small rule ensembles out of a larger tree ensemble.

Performance Estimation

To evaluate the performance of the proposed probabilistic models, we consider the following error or “loss” functions given an observed morphological outcome y and covariate vector x. The full phase error is defined as 0 if the predicted phase is equal to the observed phase and 1 otherwise, i.e.,

graphic file with name ci3c00460_m008.jpg 8

This very strict notion of error, which also known as the 0/1-loss in the machine learning literature, does not quantify how close an incorrectly predicted phase is to the observed phase. For example, if y corresponds to “pure sphere” the prediction ŷ corresponding to “sphere and worm” has a phase error of 1. As a more refined notion of error, we consider the individual morphology error defined as the fraction of morphologies predicted incorrectly (out of all modeled morphologies), i.e.,

graphic file with name ci3c00460_m009.jpg 9

It is easy to see that for all x and y we always have lpha(x,y) ≥ lmor(x,y). For instance in the above example, we obtain a morphology error of 1/4 because the occurrence of only one out of four morphology groups (worm) has been predicted incorrectly. This loss function, also known as the hamming loss, might appear more reasonable than the phase error. However, it effectively decomposes the modeling problem into l separate problems, one for each morphology, and when optimizing this loss instead of the phase error, models might be worse in predicting joint events, e.g., whether the resulting phase will be pure or mixed. Ultimately, both error notions only depend on the prediction ŷ and do not measure the quality of the uncertainty estimates. This is why probabilistic models are typically evaluated and fitted with respect to the log loss, which is defined as the negative logarithm of the modeled probability of the observed outcome, i.e.,

graphic file with name ci3c00460_m010.jpg 10

where in this work we use the logarithm of base 2. Going back to our above example, assuming that the modeled probability of the predicted “sphere and worm” is 0.6 and the modeled probability of the observed “pure sphere” is (1, 0, 0|x) = 0.25, the log loss is – log 0.25 = 2. This concept is known as negative log likelihood in statistics, as cross entropy loss in deep learning, and in information theory, it corresponds to the amount of bits of information (or surprisal) that a user of the model receives when observing the outcome y.

The above loss functions quantify the error of a model for an individual observation. An overall assessment of the model quality is then given by the average loss over a random PISA setup X with corresponding outcome Y. This quantity is referred to as prediction risk of the model and is formally defined as expected value

graphic file with name ci3c00460_m011.jpg 11

Here, the expectation is taken with respect to the same distribution that is represented by the training data, which is why we refer to this risk as interpolation risk. In practice, one would in particular like to apply a model to monomer combinations that were not available when fitting the model. The expected performance in such a scenario is captured by the more challenging notion of extrapolationrisk defined as

graphic file with name ci3c00460_m012.jpg 12

where C denotes the observed monomer composition at prediction time and C1, ..., Cn the monomer compositions observed in the training data. Depending on the employed loss function, these prediction risks are also referred to as phase error rate (l = lpha), morphology error rate (l = lmor), and cross entropy (l = llog).

These quantities are not directly observable and have to be estimated from data. To avoid positively biased estimates and at the same time use as much as possible of our data collection for model fitting, cross validation (CV) was performed to obtain unbiased estimates of both interpolation (30-fold CV) and extrapolation performance (grouped CV with respect to specific monomer pair hold-out test sets). See Methods for a detailed description. To put the prediction risks into context, two additional baselines were computed per loss function: an uninformed baseline, which simply assigns uniform probabilities to all phases, and an informed baseline, which assigns to each phase its marginal probability (i.e., independent of the covariates) as estimated by the data collection. Note that this informed baseline uses information that is actually not available during training time. It thus constitutes an optimistic estimate of the performance that can be achieved when not taking the covariates into account.

Figure 5 shows the obtained results for all model variants (see Supporting Information for numerical results and further detailed performance statistics in the form of confusion matrices). As expected, the model performances tend to increase with decreasing model interpretability. The only exception to this is the estimated interpolation risk with respect to the log loss of RuleFit, which is less than that of the more complex random forest model. Moreover, as implied by the definition, the individual morphology error rate is consistently lower than the full phase error rate, and the risk values of the informed baseline are lower that those of the uninformed baseline of being completely agnostic about the morphological outcomes. While the different risk types do not have a definite order based on their definition, we can observe the intuitive outcome that training risk is generally lower than the interpolation test risk, which in turn is lower than the extrapolation test risk; the only exception here being the linear model’s expected hamming and 0/1-loss.

Figure 5.

Figure 5

Overall model performance assessment as estimated by cross validation. The error rate for predicting the presence of an individual morphology (LHS), or predicting the correct morphological mixture (center) is shown, along with the log loss (degree of surprisal, RHS). Horizontal lines indicate uninformed baseline (assuming uniform phase probabilities) and informed baseline (assuming marginal phase probabilities estimated from the full dataset). Error bars indicate the standard errors across the 30-fold cross-validation.

Importantly, all but the linear models show a decent interpolation performance, i.e., when predicting the mixture of morphologies formed from monomer pairs that are for the most part already encountered in the training data. In terms of the log loss, the three nonlinear models achieve a risk between 1.02 bit (RuleFit) and 1.29 bit (GAM), which means that even the latter beats the informed baseline by more than 1 bit. Intuitively, even the GAM model provides on average information about one morphology that one would be completely uncertain about by just knowing the overall phase probabilities. The 0/1 (full phase error) prediction risk of the three models ranges between 0.21 (random forest) and 0.28 (GAM), or conversely, even the GAM predicts the exact phase vector correctly with an estimated probability of 0.72 compared to only 0.41 correct predictions of the informed baseline. Finally, in terms of the hamming loss, i.e., the average performance for predicting the presence or absence of individual morphologies, the three complex models achieve a very convincing accuracy between 0.93 (random forest and RuleFit) and 0.91 (GAM) compared to only 0.5 of the uninformed baseline and 0.72 of the informed baseline.

Notably, even the performance of the interpretable rule ensemble and GAM models appears slightly better than what is achieved by noninterpretable support vector machines and tree ensembles for a similar non-PISA emulsion polymerization data set, where an interpolation accuracy of 0.8 was observed.82 However, since the training data set and problem are different between these studies, the absolute accuracies are not directly comparable.

When considering extrapolation to new monomer combinations, the model performances are weaker, but the best model (random forest) still achieves highly nontrivial prediction performance (1.6 bit average log loss, 0.27 0/1-error risk, and 0.12 hamming risk). Also GAM and RuleFit have decent 0/1 risks of around 0.37 and hamming risks of around 0.15, which in both cases is substantially better than the informed baseline. Only in terms of the log loss extrapolation risk both, GAM and RuleFit, fail to beat the baseline, implying that while their phase predictions are still valuable in this challenging setting, their probability estimates have to be considered with caution.

Model Analysis

After evaluating the predictive performance of the proposed models, we now analyze and interpret them semantically and compare their predictive logic to PISA domain knowledge. The models are covered in order of the complexity of their individual terms, i.e., first linear, then generalized additive, and finally rules and random forests. As mentioned above, the additive model design (4) allows interpretation in a modular83 fashion where each term is analyzed separately. By additionally assigning an importance value to each term, we focus the analysis on the terms that have on average the highest impact on the predictions.

Linear Model

The individual model terms are the simplest for the linear model, where, for each combination of input variable and morphology, we have exactly one term, which is given by a single coefficient. A positive coefficient indicates a positive (linear) effect of the variable on the log-odds of forming the morphology, and a negative coefficient indicates a negative effect. One might be tempted to additionally interpret the magnitude of the coefficient as the strength of the effect. However, this is not directly possible because of the different scales of the input variables. Hence, we use as importance score of term j the magnitude of the coefficient of the normalized version of the corresponding variable, |βj|s(Xj), where s denotes the sample standard deviation.

Table 2 lists the five most important variables per morphology with respect to this approach. For spheres, worms, and vesicles, the top ranked feature was the total molar mass of the core. However, counterintuitively, mw_tot_cre was positively correlated with sphere formation and negatively correlated with the formation of the higher order morphologies, worms and vesicles. In all cases, this was offset by the intuitive correlation of the second (for worms and vesicles) or third most important feature (for spheres), mv_tot_cre. Higher mv_tot_cre values disfavor the formation of spheres and favor the formation of worms and vesicles, as expected by considering the increasing packing parameter of the assembly as the core volume increases. Taking these two inter-related features into consideration, it was noted that denser core monomers (with greater monomer mass/volume ratios) favored the formation of spheres over higher order structures. This interesting finding can be rationalized by considering that the top ranked core monomers by density were TFMA, cyclic, GlyMA, PhA, and EGDMA (see Supporting Information). Each of these are either cross-linkers present in low mol % in the core, or are glassy highly hydrophobic monomers that have poor solvation and mobility once they become oligomeric. Both of these monomer types hinder the possibility of phase transitions toward higher order morphologies.

Table 2. Linear Model Five Most Important Terms (Variables) per Morphology with Their Coefficient (β) and Importance Score Corresponding to the Rescaled Coefficient by Variable Sample Standard Deviationa,81.
sphere
worm
vesicle
variables coef. imp. variables coef. imp. variables coef. imp.
mw_tot_cre 0.0005 18.1574 mw_tot_cre –0.0004 15.9898 mw_tot_cre –0.0004 13.9388
dp_cre –0.0487 11.6435 mv_tot_cre 0.0242 7.2232 mv_tot_cre 0.0254 7.5822
mv_tot_cre –0.0213 6.3723 dp_cre 0.0276 6.5987 dp_cre 0.0258 6.1585
apol_cna 0.2035 2.8785 mw_tot_cna –0.0003 1.0360 mw_tot_cna –0.0007 2.8497
mv_tot_cna 0.0496 1.5400 apol_cna –0.0597 0.8446 mv_tot_cna 0.0752 2.3325
a

The morphology order corresponds to the order in which the individual linear models are combined in the classifier chain.

Another variable among the three most important variables for all morphologies was dp_cre. Higher dp_cre values disfavor the formation of spheres and favor the formation of worms and vesicles, which again is to be expected from domain knowledge, in agreement with packing parameter arguments. Moreover, the apol_cna term in the sphere model reflects that greater corona monomer polarizabilities favor sphere formation, which can be understood considering the top ranked corona monomers in order of polarizability: GluMA, KSPMA, CysMA, MPC, and DMAPS. All are highly charged monomers that show high electrostatic repulsion between corona chains and/or large degrees of corona solvation, thus disfavoring the formation of higher order morphologies and favoring the formation of spheres, all other factors being equal.

For worms and vesicles, mw_tot_cna appears as an important feature with a negative coefficient, as expected from packing parameter considerations discussed previously. Interestingly, similar density arguments this time relating to the corona monomer were important for vesicle formation. The top ranked corona monomers by density were KSPMA, MAA, GluMA, CysMA, DSDMA, and MPC (see Supporting Information). Aside from the DSDMA cross-linker (which have previously been discussed as disfavoring higher order structures), each are charged or ionizable monomers that disfavor vesicle formation owing to coronal charge–charge repulsion. Therefore, with corona monomer mass being equal, increasing corona monomer volume (thereby decreasing monomer density) favors vesicle formation. It is worth highlighting that the associations of these particular monomers in the data set may not be true across all chemical structures outside of the data set (e.g., not all charged monomers have higher density). The discovered trends are not necessarily causal. They primarily reflect associations that appear to hold in the observational distribution of the considered PISA literature.

Generalized Additive Model

For the GAM model, we still have exactly one term per variable/monomer combination. However, instead of being defined by a single linear coefficient, the effect of the variable on the morphology log odds is now described by some non-linear transformation. While this dependency is considerably more complex, it can still be faithfully visualized in a partial dependency plot(84) (see Section 10.13.2 in that reference). Conceptually, the partial dependency of the model prediction on variable j is defined by averaging the model output f(x) over all values of the other variables while keeping the value of variable j fixed to x, i.e., the univariate partial dependency function gj(x) and its data-based approximation are

graphic file with name ci3c00460_m013.jpg 13
graphic file with name ci3c00460_m014.jpg 14

Specifically for GAMs, these partial dependencies are faithful in the sense that the dependency of f(x) on xj is the same for all x up to an additive constant. That means that independently of the other variables values, the partial dependency plot shows the effect of different xj-values on the prediction. As importance score for GAM terms, one can use the statistical dispersion of the term values as measured by the mean absolute deviation from the mean. This results in a similar scale as the importance values for the linear models.

The GAM models show similar general trends in predicting probabilities for spheres, worms, and vesicles to those found for the random forest model (discussed below), while they are somewhat less smooth and well-defined (Figure 6). Spheres are driven by lower concentrations and higher corona DPs as the top two most important variables. GAM appears to place greater importance on the individual monomer features such as clogP_cre and psa_cre. This dependency can be understood in physical terms by the fact that having more hydrophobic monomers in the core (with higher clogP and lower polar surface area) results in less solvated cores, with a greater energy barrier for polymer chain exchange and particle–particle fusion. This means more hydrophobic cores result in kinetically trapped spheres that are unable to undergo phase transitions. Similarly, having a charged end group results in a greater likelihood of forming spheres because of charge–charge repulsion, increasing the cone angle and limiting particle–particle fusion events. Spheres and vesicles show opposite trend behaviors to each other (for example clogp_cre) and worms often show a bell-shaped dependency (for example in vol_ratio and mass_ratio). As expected from packing parameter arguments, both vol_ratio and mass_ratio are inversely correlated to vesicle formation. It is also noteworthy that GAMs appear to place greater emphasis on inputs from previous elements of the classifier chain, identifying the inverse relationships between predicted presence of spheres and worms (third ranked worm-determining variable), predicted presence of spheres and vesicles (top ranked vesicle-determining variable), and predicted presence of worms and vesicles (fifth ranked vesicle-determining variable). In the latter case, the predicted absence of both spheres and worms greatly increases the likelihood of a vesicle through the process of elimination.

Figure 6.

Figure 6

GAM five most important variables per morphology with their partial dependency plots (importance score after colon, unit of variable in brackets where applicable). The y-axis represents the average model probability of the morphology occurring for the corresponding variable value on the x-axis. Actual probabilities can differ for specific values of the other variables only by an additive constant; i.e., the plot faithfully represents the effect of the variable on the modeled probability.

Rule Ensembles

In the interpretability spectrum, rule ensembles are not directly comparable to GAMs. On the one hand, their additive terms (rules) can be summarized by listing only a few parameters, namely the rule condition and the rule coefficient (consequent). On the other hand, the number of those terms is typically much larger than for GAMs, which only have at most one term per variable. Again, it helps to focus on the most important terms via computing an importance score. For that purpose, as for the linear model, we can again use the coefficient weighted by standard deviation of the rule condition over random inputs. Since the rule conditions are binary functions this simplifies to

graphic file with name ci3c00460_m015.jpg 15

Table 3 shows the five most important rules for each morphology along with their importance scores, coefficients as well as how many entries (count) meeting the identified rules/combination of rules out of 592 entries in total. In general, features such as total core and corona molar masses of the polymers, as well as their mass and volume ratios, the cLogP values of the respective monomers, the concentration, and whether the polymer end group was charged were identified as the most important features by the RuleFit model as shown in Table 3. These are in good agreement with those obtained by GAM and random forest models.

Table 3. RuleFit Five Most Important Rules per Morphologya.
sphere
worm
vesicle
condition coef. imp.(num) condition coef. imp.(num) condition coef. imp.(num)
mw_tot_cre > 16161.4751 conc > 16.1 mw_tot_cna ≤ 12760.5298 –2.9319 1.4166 (207) 8503.8901 < mw_tot_cna ≤ 12760.5298 vol_ratio > 0.2135 1.8748 0.85406 (155) temp > 65.0 vol_ratio ≤ 0.2599 sphere = 0 1.8179 0.8247 (162)
dp_cna ≤ 65.0 vol_ratio ≤ 0.3104 clogp_cre ≤ 1.37 –1.7521 0.8759 (298) apol_cna ≤ 42.472 mass_ratio ≤ 0.248 sphere = 0 –1.894 0.8235 (152) mw_tot_cna ≤ 56.8499 mass_ratio ≤ 0.37610 ph ≤ 4.25 2.3528 0.7713 (63)
mass_ratio ≤ 0.4342 mw_cre ≤ 145.3096 charge = 0 –1.7228 0.8285 (209) conc > 11.25 mass_ratio ≤ 0.4286 clogp_cre ≤ 1.4 1.3809 0.6899 (299) mw_tot_cna > 10236.2549 mass_ratio ≤ 0.2292 conc > 11.25 1.9049 0.6061 (58)
mw_tot_cre > 143.8346 charge = 0 vol_ratio ≤ 0.6135 1.3356 0.6564 (243) mv_tot_cre > 120.6472 psa_cre ≤ 46.7805 mv_tot_cre > 171.5035 –1.3361 0.6036 (424) conc ≤ 18.75 vol_ratio ≤ 0.3794 mw_tot_cna ≤ 5806.375 2.0187 0.5888 (48)
mw_tot_cna > 9204.0249 charge = 1 –1.3740 0.6356 (179) ph > 2.625 vol_ratio > 0.2153 –1.2106 0.5991 (265) mass_ratio ≤ 0.2292 mw_tot_cna > 4260.0101 clogp_cre ≤ 1.18 1.1943 0.5851 (239)
a

For each rule condition, coefficient, and importance are given. The number in parentheses below the importance denotes the number of data points that satisfy the condition.

For spheres, RuleFit’s most important terms identify conditions under which the probability of sphere formation is decreased, indicated by their negative coefficient. For instance, the first rule states that if mw_tot_cre is above around 16 kg/mol, mw_tot_cna is less than or equal to around 12.8 kg/mol, and conc is greater than 16.1 wt %, then the log odds of spheres decreases by approximately 2.93. When breaking this rule into its individual components, we see that for example the mw_tot_cre range of >16 kg/mol describes a medium-high range of the overall data sets’ range (2,881.6–338 220 g/mol, where 75% of the formulations fall ≤44 020 g/mol) and mw_tot_cna ≤ 12,760.53 g/mol describes a low range of the overall data set (1981.35–45743.6 g/mol); see Supporting Information for a full table of variable sample statistics. The fact that these conditions, when combined, result in a decreased probability of sphere formation is in agreement with domain knowledge that short corona blocks relative to the core block disfavor the formation of spheres. Similarly, conc values above 16.1 wt % disfavor spheres, since higher concentrations promote the formation of higher order morphologies through sphere–sphere fusion events.

At the other end of the morphological spectrum, RuleFit identified that high reaction temperatures, low corona/(corona + core) volumetric ratios, and the absence of spheres favor the formation of vesicles in the most important rule. Again, these trends are in agreement with domain understanding of PISA processes. Looking through the list of other rules for vesicle formation, lower reaction pH (often employed in order to partially neutralize ionizable acidic polymer end groups and coronas), lower corona/(corona + core) mass ratios, and higher concentrations also increase the probability of vesicle formation, as discussed previously. One exception was found where medium-low concentrations (>18.75 wt %) unexpectedly favored the formation of vesicles; however, this rule only applied when the molar mass of the corona was very low (≤5.8 kg/mol) and the corona/(corona + core) volumetric ratio was less than or equal to 0.38. Both caveats to this rule favor vesicle formation from a packing parameter consideration, providing the possibility of forming vesicles at lower concentrations.

Overall, the rules identified by RuleFit agree well with domain knowledge as well as with the results obtained by GAM and random forest models. Importantly, in contrast to linear model and GAM, the inclusion of variable interactions allows rule ensembles to describe relative relations (such as high mw_tot_cre and low mw_tot_cna in the first sphere rule) and exceptions to global trends (like in the last discussed vesicle rule).

Random Forest

While random forests can technically be interpreted as very large rule ensembles (where each leaf of each tree corresponds to one rule), their scale renders a manual inspection of the corresponding rule set impractical, and in contrast to RuleFit ensembles, even the most important rules of a random forest tend to have only little effect on the overall prediction. To nevertheless provide an analysis of the model, we can again resort to partial dependency plots. However, note that, in the context of random forests, these plots are no longer faithful because of variable interaction effects. Specifically, the partial dependency plot only shows the average effect of a variable on the prediction. When fixing the other variables to concrete values, the shape can look different. To focus on the most important partial dependency plots, we can again use a variable importance score. In the context of tree models, and by extension for additive combinations of them, the most common choice is to consider the average decrease of impurity that is reached throughout the tree when splitting on the corresponding variable.

Figure 7 gives the partial dependency plots for the five most important features per morphology. For predicting the presence of spheres, conc is the most important feature, with a decay in sphere probability occurring with increasing concentration. This relationship aligns with domain knowledge, since lower concentrations reduce the particle collisions required for phase transitions from spheres into higher order morphologies. Next in importance are mass_ratio and vol_ratio, which show an increasing probability of spheres with increasing corona to core plus corona ratio. This is again understood by the notion that increasing the corona mass and volume relative to the total block copolymer decreases the packing parameter, forcing the nano-objects to adopt spherical structures with greater cone angles. The steric repulsion of corona chains also prevents sphere–sphere fusion and subsequent phase transitions. Similarly, mv_tot_cre and dp_core show an inverse correlation with sphere probability, since increasing the core volume and length increases the packing parameter and decreases the cone angle, increasing the drive toward higher order morphologies.

Figure 7.

Figure 7

Random forest five most important variables per morphology with their partial dependency plots (importance score after colon, unit of variable in brackets where applicable). The y-axis represents the average model probability of the morphology occurring for the corresponding variable value on the x-axis. The effect can differ for specific values of the other variables.

Looking at the other extreme morphology, vesicles, opposite trends are observed for vol_ratio, mass_ratio, conc, and to a less-well-defined extent, mw_tot_cre. The formation of vesicles is driven by high concentrations, allowing for greater particle–particle collisions, and a reduction in coronal steric repulsion and increase in packing parameter. It is noteworthy that the most important variable predicting the presence of a vesicle is the input of the previous model in the classifier chain whether a sphere has formed under the same conditions. Since sphere and vesicle are the two extreme morphologies in terms of packing parameter, size, and morphological evolution during the PISA process, it is logical that having a higher probability of sphere formation results in a lower probability of vesicle formation, and vice versa. It is also encouraging to see that worms, being an intermediate morphology, showed approximately bell-shaped dependency plots across all of the five most important variables. As the value of these variables increased, the probability of worms increased, alongside the probability of spheres decreasing, then at some critical point the probability of worms started to decrease, making way for the increasing probability of vesicles. It is interesting that predictors describing the monomers themselves did not feature in the top five variables for random forest; however, note that compound features such as mass_ratio already contain information relating to the core and corona monomer molar masses.

Actively Learned Phase Diagrams

In addition to insights into PISA and the provision of phase predictions with uncertainties, a central purpose of the developed models is the combination of many such predictions into automatically generated phase diagrams. In this context, an important facility of probabilistic models is to provide not only extrapolated phase diagrams based on currently available data but also guidance on what additional data would be most effective to increase their quality. In other words, the models support a dynamic iterative design of experiments, which is referred to as active learning in the machine learning literature. The central component for this is an acquisition function that, based a on model fitted to the present data, ranks the importance of potential future experiments. In our case, a natural choice for this function is the total model uncertainty of the full phase vector predicted at a candidate configuration, which can be measured as the conditional information entropy of the phase distribution given the experimental design x:

graphic file with name ci3c00460_m016.jpg 16

This choices facilitates the acquisition of points for which the probability of the 2l different phases (16 in our case) is closest to uniform. In particular, the entropy tends to be high at points close to the phase boundaries predicted by the current model fit, which are reasonable test points to refine those boundaries.

This active learning of phase diagrams was evaluated in three case studies corresponding to three different pairs of core and corona monomer: HPMA/CysMA, MEMA/GMA, and HPMA/GMA. For each combination, the condition with the most data points was identified as the target condition (where by condition we refer to the values for corona degree of polymerization, salt, pH, end group charge, and temperature, with concentration and core degree of polymerization being the two free variables of the targeted phase diagrams). The above three monomer pairings were the only three cases in the dataset where the target condition determined in this way had 20 or more data points with varying concentration and core degree of polymerization, as well as a nontrivial phase behavior with all three morphologies present. The evaluation focused on random forests to predict phase diagrams based on their superior performance in particular in extrapolation settings.

For each case, the data points corresponding to the target condition were left out for training the initial model. Then iteratively, the model fit was updated by adding the point of the target condition to the training data that maximizes the acquisition function (16). During the process the error metrics defined above were tracked and the phase diagrams corresponding to the currently fitted probabilities were plotted. For HPMA/CysMA as well as for MEMA/GMA, all data points of the chosen monomer combination share the same condition. Hence, these cases correspond to realistic full extrapolation scenarios where none of the initial training points contain the targeted monomer combination.

In the first case study (core: HPMA/corona: CysMA) presented in the top panel of Figure 8, all 22 target data points share the following conditions: a corona degree of polymerization of 31, salt 0.0, pH 7.0, a charged end group, and a temperature of 70 (°C). For this case, the initial model predicts an all-sphere diagram (first column/top row) corresponding to a high 0/1-error of 0.77. This behavior was observed in most full extrapolation test cases and can be explained by the prevalence of (pure) sphere outcomes in the data set reflecting the experimental designs surveyed from the literature. It can also be observed in the partial dependency plots in Figure 7 that, on average, the lowest probability of finding spheres is often found to still be higher than the highest probability of forming a worm or vesicle. Notably, even with this current predisposition, the total uncertainty acquisition function (visualized in the lower row of each panel) correctly identifies the upper-right region, where the all-sphere prediction is incorrect, as the region with the highest uncertainty. Consequently, already the first training point of the target condition creates meaningful phase boundaries with all three individual morphologies present (0/1-error dropping to 0.59, hamming loss to 0.3). Moreover, after three points the final phase boundaries are already qualitatively identified, with only a further five points required to drive the 0/1-error below 0.05. Hence, this case demonstrates how a pretrained model can strongly reduce the number of experiments required to obtain an adequate phase diagram for a previously unseen monomer pair.

Figure 8.

Figure 8

Snapshots of actively learned phase diagrams and acquisition values (i.e., degree of prediction uncertainty, in orange) for different numbers of condition-specific training points (m). The leftmost column shows initial phase diagram without any data of the target condition, the rightmost shows the smallest number of training points with full phase error 0.

In the second case study (core, MEMA; corona, GMA) presented in the middle panel of Figure 8, all 26 data points share the conditions of a corona degree of polymerization of 29, salt 0.0, pH 3.5, a noncharged end group, and a temperature of 70 (°C). While here the initial model also predicts spheres to be present for all considered concentrations and core degrees of polymerization, it also contains a region with worms for intermediate dp_core values. In this case, it takes longer until the qualitatively final phase diagram is obtained, which does happen after 13 observed target condition points when also the 0/1-error is already very low (0.04). Also, it takes longer than in the first case to eliminate all test errors (up to 22 observed target points). This comparatively weaker performance can at least in part be attributed to the isolated point with a completely mixed phase (sphere, worm, and vesicle), which deviates from the otherwise simple vertical pattern with three discrete phase regions. Given this increased difficulty, it is still remarkable that the model requires only half of the available data to essentially identify the final phase diagram. Also, assuming that the morphological outcome of the isolated point is reliably reproducible, the model already identifies this critical point as worth investigating after only six rounds of experimentation.

In the third case study (core: HPMA. corona: GMA) presented in the lower panel of Figure 8, the following conditions constituted the largest group with 44 data points: corona degree of polymerization 78, salt 0, pH 7, noncharged end group, and temperature 70 (°C). This case has a substantially more complex final phase diagram than the previous two cases. Instead of the simple vertically banded phase regions, there is an L-shaped region for spheres and rectangular regions for worms and vesicles, with lower bounds on the concentration. It is to be expected that more training examples of the target condition are necessary to approximately determine these complex phase boundaries. Indeed, while the qualitatively final picture starts to emerge already after 6 examples, the predicted phase boundaries are still rather rough and, correspondingly, the 0/1-error is high (fraction of 0.5 of data points are not predicted entirely correctly). After m = 13 examples, the final pattern is established, but it takes 3/4 of the data to drive the error down to 0. What is again remarkable is the ability of the model and the acquisition function to target the experimentation toward points that lie close to the boundary of the finally established phase diagram.

In summary, even in complicated cases, qualitatively good phase diagrams are obtained by the random forest model using only a relatively small number of experimental results of the target condition. While determining the exact final phase boundaries can take more experiments, the machine learning approach does not only provide good estimates for unseen parts of the phase space, it is also useful in proposing experiments with a strong information gain. Both the initial predictions and the data efficiency are likely to further increase if the model is pretrained with a wider range of monomer combinations and conditions.

Conclusion

Herein, we presented the first unified data set curation and interpretable machine learning framework for data-driven prediction of morphological outcomes and recommendation of experimental designs in PISA. Employing simple features describing the monomers, polymers, and reaction conditions and combining them in an additive classifier chain design, we showed that the resulting models exhibit reasonable performance in predicting individual and joint morphological outcomes, including for new monomer pairs, as assessed against two statistical baselines. Exploiting the interpretability of their design, we analyzed all four employed modeling variants, and showed that the algorithm-derived trends are in agreement with domain knowledge of polymer self-assembly. Finally, through active learning using the random forest model as an example, we demonstrated that such models can direct an experimentalist toward regions of the phase space that offer the greatest information gain in predicting the morphological outcomes of its neighboring data points. This can reduce the number of practical experiments required to obtain accurate empirical phase diagrams. As mentioned above, one limitation if the models fitted by our current dataset is their reflection of the limited scope and trends of the surveyed literature, as, e.g., expressed in the prevalence of sphere formation. However, we envision that having the community build upon this dataset to increase the diversity of chemical structures and morphologies will provide an accurate generalized model for prediction of PISA outcomes and de novo experimental design of bespoke self-assemblies. Once a sufficiently large and diverse set of training data is assembled, the adaption of high-variance models such as deep neural networks for molecules85,86 or Gaussian processes87 will also become viable research directions to achieve further improvements in terms of predictive and active learning performance.

Methods

Data Curation

Defined selection criteria were applied to identify a total of 52 aqueous RAFT PISA publications reported by the Armes’ Group, up to and including 2019. A full list of these publications is cited in Statistical Population and Data and are also available with DOI links in GitHub. While some variations of RAFT PISA were reported, for example photoinitiated RAFT, in order to avoid variation between the different polymerization mechanisms, only thermally initiated RAFT polymerization methods were included. However, monomer systems that could undergo both dispersion and emulsion RAFT PISA were included and no distinction was made between these systems. Cosolvent systems, for example alcohol/water mixtures, were excluded and only pure aqueous systems were curated; however, systems where different salt concentrations or pH values were used were allowed to feature in the dataset. In reports where the pH was not specified, a neutral default value of pH 7 was used. Finally, only true diblock copolymer systems with a single corona and a single core block were included. This criterion excluded systems where mixed corona systems (i.e., chain extension from two different macro-chain transfer agents simultaneously) or triblock copolymer systems were reported; however, formulations that used a statistical copolymer corona block and/or statistical copolymer core block were included in the dataset. Only formulations that reached over 95% conversion were included, in order to limit the solubilizing effect of having residual unreacted core monomers present at the time of morphological assessment. Consequently, a total of 592 formulations were included in this study, and the final full list of these entries including all covariates and experimental information is also provided in GitHub.

The morphological outcomes from each of the formulations included in the data set were assigned to six morphological classes: “no assembly”, “spheres”, “worms”, “vesicles”, “precipitates”, and “other”. Within each morphological class, some further simplifications were applied in order to group similar morphologies together. For example, terms such as “bicontinuous worm”, “worm”, and “worm-like micelle” were assigned to the output category “worm”. Similarly, “multi-lamellar vesicles”, “oligo-lamellar vesicles” and “vesicles” were assigned to the “vesicle” output category. The “other” class contained all other miscellaneous morphologies such as “jellyfish”, “lamellae”, “dumbbells”, “lumpy rods”, and “monkey nut” morphologies.

For each formulation reported, the monomer composition of each block, degree of polymerization of each block, experimental conditions, and final morphology class were recorded. From this, weighted averages of the molecular descriptors for each monomer were calculated for each block. Covariates describing the total block molar masses, molecular volumes, and mass and volume ratios of corona and corona blocks were calculated using the degree of polymerization and individual monomer properties, as outlined in Table 1.

Molecular Descriptor

The PISA literature compiled in this study comprises 15 different corona forming and 12 different core forming monomers, of which the molecular structures and identifiers are given in Figure 2. The molecular descriptors for these monomers were obtained using a free and open-source software PaDEL. A detailed explanation about the software and its capabilities can be found elsewhere.88 In this work, monomer structures were first drawn using ChemDraw and saved as MOL files before being introduced to the PaDEL software.

Model Fitting

To fit the linear logistic regression model, we employed the Python implementation from the sklearn library in version 0.23.2. In particular, we used the LogisticRegressionCV with penalty l2 and lbfgs solver.89 This choice fits the coefficients β0, ..., βp by minimizing the regularized empirical risk with respect to the log loss, i.e.,

graphic file with name ci3c00460_m017.jpg 17

where (1|X = x) = σ(xTβ) and the hyperparameter α is found by optimizing the mean test log loss over an internal stratified 5-fold cross validation on the training set (which is the default setting).

To fit the Generalized Additive Model, we employed the Python implementation from the interpret(90) library in version 0.2.7, using the ExplainableBoostingClassifier with only univariate terms, i.e., setting the “interactions” parameter to “0”. The implementation uses shallow tree-like ensembles for the univariate terms, f1, ... fp, fitted by gradient boosting, i.e., stagewise minimization of the log loss.

To fit the RuleFit model, we employed the Python implementation from rulefit in version 0.3.1. In particular, we used the RuleFit class with model_type parameter set to lr, with which both, linear terms and rule terms are considered. Specifically, the fitting procedure finds the sparse coefficient vector β = (β1, ..., βk) that minimizes the l1-regularized empirical risk with respect to the log loss

graphic file with name ci3c00460_m018.jpg 18

where (1|X = xi, β) = σ(∑j=1kβjqj(xi)) for a large set of candidate rule conditions q1, ..., qk that have been extracted from a tree ensemble (using the default choice XGBoost). Through the l1-regularizer,91 typically only a small set of candidate rules obtains a nonzero coefficient. We employed an internal 10-fold cross validation to choose the regularization parameter C from the set {1, 2, 4, 8, 16, 32}. In order to achieve a stable selection, we chose the C-value with the best median rank of empirical log loss across test folds.

To fit the Random Forest model, we employed the Python implementation from the sklearn library in version 0.23.2. In particular, we used the RandomForestClassifier(80) with minimum samples per leaf 1 and 200 estimators. Moreover, individual trees were fitted using a bootstrap subsample with replacement of the training set of size n (i.e., the number of training points) and for each split all variables were considered (rendering the random forest equivalent to bagged regression trees). Both of these are the default options. As splitting criterion, the Gini impurity index was used (as is the default setting).

Model Performance Estimation

Model interpolation performance has been estimated by performing 30-fold cross validation on the overall data collection with uniform random splits. For extrapolation performance, a grouped cross validation was performed where each monomer combination with 22 or less training examples in the data became a test group. With this choice we ended up with 29 folds and a mean train (test) size of about 585.65 (6.35). This choice was made so that the resulting smallest train size (570) is never much less than the uniform train sizes of the interpolation cross validation (572), and hence, any performance degradation can be attributed to the harder evaluation mode of testing on an unseen monomer combination rather to smaller train sizes.

Acknowledgments

M.B. was supported by the Australian Research Council (DP210100045). L.D.B. was supported by a CSIRO CERC Postdoctoral Fellowship for the initial stages of this study. D.Y. was supported by a CSIRO research agreement with La Trobe University. This work was supported in part by the Australian National Fabrication Facility (ANFF), a company established under the National Collaborative Research Infrastructure Strategy, through the La Trobe University Centre for Materials and Surface Science.

Data Availability Statement

The software and data sets generated and/or analyzed for this study are available on GitHub at https://github.com/marioboley/PISA_ML.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.3c00460.

  • Descriptive statistics of covariates, monomer molar densities, covariate correlations, numerical model performances, and prediction confusion matrices (PDF)

  • SMILES data (XLSX)

Author Contributions

Y.L. and D.Y. contributed equally to this work

Author Contributions

L.D.B. and M.B. conceived the study and wrote the main sections of the manuscript, L.D.B. curated the data set, M.B. designed the modeling and active learning approach, Y.L., D.Y., and M.B. implemented the machine learning models, Y.L, D.Y., L.D.B., and M.B. evaluated the model performance. L.D.B. conducted the model interpretation. M.B. and Y.L. implemented the active learning algorithm. D.Y. and Y.L. wrote the methods section. All authors reviewed the manuscript.

The authors declare no competing financial interest.

Supplementary Material

ci3c00460_si_001.pdf (158.8KB, pdf)
ci3c00460_si_002.xlsx (14.7KB, xlsx)

References

  1. Karayianni M.; Pispas S. Block copolymer solution self-assembly: Recent advances, emerging trends, and applications. J. Polym. Sci. 2021, 59 (17), 1874–1898. 10.1002/pol.20210430. [DOI] [Google Scholar]
  2. Cabral H.; Miyata K.; Osada K.; Kataoka K. Block copolymer micelles in nanomedicine applications. Chem. Rev. 2018, 118 (14), 6844–6892. 10.1021/acs.chemrev.8b00199. [DOI] [PubMed] [Google Scholar]
  3. Jin Q.; Deng Y.; Chen X.; Ji J. Rational design of cancer nanomedicine for simultaneous stealth surface and enhanced cellular uptake. ACS Nano 2019, 13 (2), 954–977. 10.1021/acsnano.8b07746. [DOI] [PubMed] [Google Scholar]
  4. Bockstaller M. R.; Mickiewicz R. A.; Thomas E. L. Block copolymer nanocomposites: perspectives for tailored functional materials. Adv. Mater. 2005, 17 (11), 1331–1349. 10.1002/adma.200500167. [DOI] [PubMed] [Google Scholar]
  5. Kim H.-C.; Park S.-M.; Hinsberg W. D. Block copolymer based nanostructures: materials, processes, and applications to electronics. Chem. Rev. 2010, 110 (1), 146–177. 10.1021/cr900159v. [DOI] [PubMed] [Google Scholar]
  6. Hu X.-H.; Xiong S. Fabrication of nanodevices through block copolymer self-assembly. Front. Nanotechnol. 2022, 4, 762996. 10.3389/fnano.2022.762996. [DOI] [Google Scholar]
  7. Hyrycz M.; Ochowiak M.; Krupińska A.; Włodarczak S.; Matuszak M. A review of flocculants as an efficient method for increasing the efficiency of municipal sludge dewatering: Mechanisms, performances, influencing factors and perspectives. Sci. Total Environ. 2022, 820, 153328. 10.1016/j.scitotenv.2022.153328. [DOI] [PubMed] [Google Scholar]
  8. Armes S. P.; Perrier S.; Zetterlund P. B. Introduction to polymerisation-induced self assembly. Polym. Chem. 2021, 12, 8–11. 10.1039/D0PY90190C. [DOI] [Google Scholar]
  9. Penfold N. J. W.; Yeow J.; Boyer C; Armes S. P Emerging trends in polymerization-induced self-assembly. ACS Macro Lett. 2019, 8 (8), 1029–1054. 10.1021/acsmacrolett.9b00464. [DOI] [PubMed] [Google Scholar]
  10. Canning S. L.; Smith G. N.; Armes S. P. A critical appraisal of raft-mediated polymerization-induced self-assembly. Macromolecules 2016, 49 (6), 1985–2001. 10.1021/acs.macromol.5b02602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Varlas S.; Foster J. C.; O’Reilly R. K. Ring-opening metathesis polymerization-induced self-assembly (ROMPISA). Chem. Commun. 2019, 55, 9066–9071. 10.1039/C9CC04445K. [DOI] [PubMed] [Google Scholar]
  12. Phan H.; Taresco V.; Penelle J.; Couturaud B. Polymerisation-induced self-assembly (PISA) as a straightforward formulation strategy for stimuli-responsive drug delivery systems and biomaterials: recent advances. Biomater. Sci. 2021, 9, 38–50. 10.1039/D0BM01406K. [DOI] [PubMed] [Google Scholar]
  13. Lansalot M.; Rieger J. Polymerization-induced self-assembly. Macromol. Rapid Commun. 2019, 40 (2), 1800885. 10.1002/marc.201800885. [DOI] [PubMed] [Google Scholar]
  14. Cao J.; Tan Y.; Chen Y.; Zhang L.; Tan J. Expanding the scope of polymerization-induced self-assembly: Recent advances and new horizons. Macromol. Rapid Commun. 2021, 42 (23), 2100498. 10.1002/marc.202100498. [DOI] [PubMed] [Google Scholar]
  15. Ladmiral V.; Charlot A.; Semsarilar M.; Armes S. P. Synthesis and characterization of poly (amino acid methacrylate)-stabilized diblock copolymer nano-objects. Polym. Chem. 2015, 6 (10), 1805–1816. 10.1039/C4PY01556H. [DOI] [Google Scholar]
  16. Cameron N. S.; Corbierre M. K.; Eisenberg A. 1998 E.W.R. Steacie award lecture asymmetric amphiphilic block copolymers in solution: a morphological wonderland. Can. J. Chem. 1999, 77 (8), 1311–1326. 10.1139/v99-141. [DOI] [Google Scholar]
  17. Blanazs A.; Ryan A. J.; Armes S. P. Predictive phase diagrams for raft aqueous dispersion polymerization: effect of block copolymer composition, molecular weight, and copolymer concentration. Macromolecules 2012, 45 (12), 5099–5107. 10.1021/ma301059r. [DOI] [Google Scholar]
  18. Foster J. C.; Varlas S.; Couturaud B.; Jones J. R.; Keogh R.; Mathers R. T.; O’Reilly R. K. Predicting monomers for use in polymerization-induced self-assembly. Angew. Chem. 2018, 130 (48), 15959–15963. 10.1002/ange.201809614. [DOI] [PubMed] [Google Scholar]
  19. Varlas S.; Foster J. C.; Arkinstall L. A.; Jones J. R.; Keogh R.; Mathers R. T.; O’Reilly R. K. Predicting monomers for use in aqueous ring-opening metathesis polymerization-induced self-assembly. ACS Macro Lett. 2019, 8 (4), 466–472. 10.1021/acsmacrolett.9b00117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Mannodi-Kanakkithodi A.; Pilania G.; Huan T. D.; Lookman T.; Ramprasad R. Machine learning strategy for accelerated design of polymer dielectrics. Sci. Rep. 2016, 6 (1), 20952. 10.1038/srep20952. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Ramprasad R.; Batra R.; Pilania G.; Mannodi-Kanakkithodi A.; Kim C. Machine learning in materials informatics: Recent applications and prospects. npj Comput. Mater. 2017, 3 (1), 54. 10.1038/s41524-017-0056-5. [DOI] [Google Scholar]
  22. Tan J.; Sun H.; Yu M.; Sumerlin B. S.; Zhang L. Photo-PISA: Shedding light on polymerization-induced self-assembly. ACS Macro Lett. 2015, 4 (11), 1249–1253. 10.1021/acsmacrolett.5b00748. [DOI] [PubMed] [Google Scholar]
  23. Yeow J.; Sugita O. R.; Boyer C. Visible light-mediated polymerization-induced self-assembly in the absence of external catalyst or initiator. ACS Macro Lett. 2016, 5 (5), 558–564. 10.1021/acsmacrolett.6b00235. [DOI] [PubMed] [Google Scholar]
  24. Ng G.; Yeow J.; Xu J.; Boyer C. Application of oxygen tolerant PET-RAFT to polymerization-induced self-assembly. Polym. Chem. 2017, 8, 2841–2851. 10.1039/C7PY00442G. [DOI] [Google Scholar]
  25. Wang G.; Schmitt M.; Wang Z.; Lee B.; Pan X.; Fu L.; Yan J.; Li S.; Xie G.; Bockstaller M. R.; Matyjaszewski K. Polymerization-induced self-assembly (PISA) using ICAR ATRP at low catalyst concentration. Macromolecules 2016, 49 (22), 8605–8615. 10.1021/acs.macromol.6b01966. [DOI] [Google Scholar]
  26. Shahrokhinia A.; Scanga R. A.; Biswas P.; Reuther J. F. PhotoATRP-induced self-assembly (photoATR-PISA) enables simplified synthesis of responsive polymer nanoparticles in one-pot. Macromolecules 2021, 54 (3), 1441–1451. 10.1021/acs.macromol.0c02106. [DOI] [Google Scholar]
  27. Wang K.; Wang Y.; Zhang W. Synthesis of diblock copolymer nano-assemblies by PISA under dispersion polymerization: comparison between atrp and raft. Polym. Chem. 2017, 8, 6407–6415. 10.1039/C7PY01618B. [DOI] [Google Scholar]
  28. Jiang J.; Zhang X.; Fan Z.; Du J. Ring-opening polymerization of N-carboxyanhydride-induced self-assembly for fabricating biodegradable polymer vesicles. ACS Macro Lett. 2019, 8 (10), 1216–1221. 10.1021/acsmacrolett.9b00606. [DOI] [PubMed] [Google Scholar]
  29. Hurst P. J.; Rakowski A. M.; Patterson J. P. Ring-opening polymerization-induced crystallization-driven self-assembly of poly-L-lactide-block-polyethylene glycol block copolymers (ROPI-CDSA). Nat. Commun. 2020, 11, 4690. 10.1038/s41467-020-18460-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Ding Y.; Cai M.; Cui Z.; Huang L.; Wang L.; Lu X.; Cai Y. Synthesis of low-dimensional polyion complex nanomaterials via polymerization-induced electrostatic self-assembly. Angew. Chem., Int. Ed. 2018, 57 (4), 1053–1056. 10.1002/anie.201710811. [DOI] [PubMed] [Google Scholar]
  31. Figg C. A.; Simula A.; Gebre K. A.; Tucker B. S.; Haddleton D. M.; Sumerlin B. S. Polymerization-induced thermal self-assembly (PITSA). Chem. Sci. 2015, 6, 1230–1236. 10.1039/C4SC03334E. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Zaquen N.; Yeow J.; Junkers T.; Boyer C.; Zetterlund P. B. Visible light-mediated polymerization-induced self-assembly using continuous flow reactors. Macromolecules 2018, 51 (14), 5165–5172. 10.1021/acs.macromol.8b00887. [DOI] [Google Scholar]
  33. Liu D.; Cai W.; Zhang L.; Boyer C.; Tan J. Efficient photoinitiated polymerization-induced self-assembly with oxygen tolerance through dual-wavelength type I photoinitiation and photoinduced deoxygenation. Macromolecules 2020, 53 (4), 1212–1223. 10.1021/acs.macromol.9b02710. [DOI] [Google Scholar]
  34. Rieger J.; Stoffelbach F.; Bui C.; Alaimo D.; Jérôme C.; Charleux B. Amphiphilic poly(ethylene oxide) macromolecular RAFT agent as a stabilizer and control agent in ab initio batch emulsion polymerization. Macromolecules 2008, 41 (12), 4065–4068. 10.1021/ma800544v. [DOI] [Google Scholar]
  35. Ali A. M. I.; Pareek P.; Sewell L.; Schmid A.; Fujii S.; Armes S. P.; Shirley I. M. Synthesis of poly(2-hydroxypropyl methacrylate) latex particles via aqueous dispersion polymerization. Soft Matter 2007, 3, 1003–1013. 10.1039/b704425a. [DOI] [PubMed] [Google Scholar]
  36. Li Y.; Armes S. P. RAFT synthesis of sterically stabilized methacrylic nanolatexes and vesicles by aqueous dispersion polymerization. Angew. Chem. 2010, 122 (24), 4136–4140. 10.1002/ange.201001461. [DOI] [PubMed] [Google Scholar]
  37. Sugihara S.; Blanazs A.; Armes S. P.; Ryan A. J.; Lewis A. L. Aqueous dispersion polymerization: a new paradigm for in situ block copolymer self-assembly in concentrated solution. J. Am. Chem. Soc. 2011, 133 (39), 15707–15713. 10.1021/ja205887v. [DOI] [PubMed] [Google Scholar]
  38. Sugihara S.; Armes S. P.; Blanazs A.; Lewis A. L. Non-spherical morphologies from cross-linked biomimetic diblock copolymers using raft aqueous dispersion polymerization. Soft Matter 2011, 7 (22), 10787–10793. 10.1039/c1sm06593a. [DOI] [Google Scholar]
  39. Verber R.; Blanazs A.; Armes S. P. Rheological studies of thermo-responsive diblock copolymer worm gels. Soft Matter 2012, 8 (38), 9915–9922. 10.1039/c2sm26156a. [DOI] [Google Scholar]
  40. Semsarilar M.; Ladmiral V.; Blanazs A.; Armes S. P. Anionic polyelectrolyte-stabilized nanoparticles via RAFT aqueous dispersion polymerization. Langmuir 2012, 28 (1), 914–922. 10.1021/la203991y. [DOI] [PubMed] [Google Scholar]
  41. Chambon P.; Blanazs A.; Battaglia G.; Armes S. P. How does cross-linking affect the stability of block copolymer vesicles in the presence of surfactant?. Langmuir 2012, 28 (2), 1196–1205. 10.1021/la204539c. [DOI] [PubMed] [Google Scholar]
  42. Rosselgong J.; Blanazs A.; Chambon P.; Williams M.; Semsarilar M.; Madsen J.; Battaglia G.; Armes S. P. Thiol-functionalized block copolymer vesicles. ACS Macro Lett. 2012, 1 (24), 1041–1045. 10.1021/mz300318a. [DOI] [PubMed] [Google Scholar]
  43. Semsarilar M.; Ladmiral V.; Blanazs A.; Armes S. P. Cationic polyelectrolyte-stabilized nanoparticles via RAFT aqueous dispersion polymerization. Langmuir 2013, 29 (24), 7416–7424. 10.1021/la304279y. [DOI] [PubMed] [Google Scholar]
  44. Ratcliffe L. P. D.; Blanazs A.; Williams C. N.; Brown S. L.; Armes S. P. RAFT polymerization of hydroxy-functional methacrylic monomers under heterogeneous conditions: effect of varying the core-forming block. Polym. Chem. 2014, 5 (11), 3643–3655. 10.1039/C4PY00203B. [DOI] [Google Scholar]
  45. Cunningham V. J.; Alswieleh A. M.; Thompson K. L.; Williams M.; Leggett G. J.; Armes S. P.; Musa O. M. Poly (glycerol monomethacrylate)–poly (benzyl methacrylate) diblock copolymer nanoparticles via RAFT emulsion polymerization: synthesis, characterization, and interfacial activity. Macromolecules 2014, 47 (16), 5613–5623. 10.1021/ma501140h. [DOI] [Google Scholar]
  46. Warren N. J.; Mykhaylyk O. O.; Mahmood D.; Ryan A. J.; Armes S. P Raft aqueous dispersion polymerization yields poly (ethylene glycol)-based diblock copolymer nano-objects with predictable single phase morphologies. J. Am. Chem. Soc. 2014, 136 (3), 1023–1033. 10.1021/ja410593n. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Lovett J. R.; Warren N. J.; Ratcliffe L. P. D.; Kocik M. K.; Armes S. P. pH-responsive non-ionic diblock copolymers: Ionization of carboxylic acid end-groups induces an order–order morphological transition. Angew. Chem., Int. Ed. 2015, 54 (4), 1279–1283. 10.1002/anie.201409799. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Doncom K. E. B.; Warren N. J.; Armes S. P. Polysulfobetaine-based diblock copolymer nano-objects via polymerization-induced self-assembly. Polym. Chem. 2015, 6 (41), 7264–7273. 10.1039/C5PY00396B. [DOI] [Google Scholar]
  49. Mable C. J.; Gibson R. R.; Prevost S.; McKenzie B. E.; Mykhaylyk O. O.; Armes S. P. Loading of silica nanoparticles in block copolymer vesicles during polymerization-induced self-assembly: Encapsulation efficiency and thermally triggered release. J. Am. Chem. Soc. 2015, 137 (51), 16098–16108. 10.1021/jacs.5b10415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Warren N. J.; Mykhaylyk O. O.; Ryan A. J.; Williams M.; Doussineau T.; Dugourd P.; Antoine R.; Portale G.; Armes S. P. Testing the vesicular morphology to destruction: Birth and death of diblock copolymer vesicles prepared via polymerization-induced self-assembly. J. Am. Chem. Soc. 2015, 137 (5), 1929–1937. 10.1021/ja511423m. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Mitchell D. E.; Lovett J. R.; Armes S. P.; Gibson M. I. Combining biomimetic block copolymer worms with an ice-inhibiting polymer for the solvent-free cryopreservation of red blood cells. Angew. Chem. 2016, 128 (8), 2851–2854. 10.1002/ange.201511454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Williams M.; Penfold N. J. W.; Armes S. P. Cationic and reactive primary amine-stabilised nanoparticles via RAFT aqueous dispersion polymerisation. Polym. Chem. 2016, 7 (2), 384–393. 10.1039/C5PY01577D. [DOI] [Google Scholar]
  53. Penfold N. J. W.; Lovett J. R.; Warren N. J.; Verstraete P.; Smets J.; Armes S. P. pH-responsive non-ionic diblock copolymers: protonation of a morpholine end-group induces an order–order transition. Polym. Chem. 2016, 7 (1), 79–88. 10.1039/C5PY01510C. [DOI] [Google Scholar]
  54. Ratcliffe L. P. D.; Couchon C.; Armes S. P.; Paulusse J. M. J. Inducing an order-order morphological transition via chemical degradation of amphiphilic diblock copolymer nano-objects. Biomacromolecules 2016, 17 (6), 2277–2283. 10.1021/acs.biomac.6b00540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Lovett J. R.; Ratcliffe L. P. D.; Warren N. J.; Armes S. P.; Smallridge M. J.; Cracknell R. B.; Saunders B. R. A robust cross-linking strategy for block copolymer worms prepared via polymerization-induced self-assembly. Macromolecules 2016, 49 (8), 2928–2941. 10.1021/acs.macromol.6b00422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Lovett J. R.; Warren N. J.; Armes S. P.; Smallridge M. J.; Cracknell R. B. Order-order morphological transitions for dual stimulus responsive diblock copolymer vesicles. Macromolecules 2016, 49 (3), 1016–1025. 10.1021/acs.macromol.5b02470. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Akpinar B.; Fielding L. A.; Cunningham V. J.; Ning Y.; Mykhaylyk O. O.; Fowler P. W.; Armes S. P. Determining the effective density and stabilizer layer thickness of sterically stabilized nanoparticles. Macromolecules 2016, 49 (14), 5160–5171. 10.1021/acs.macromol.6b00987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Williams M.; Penfold N. J. W.; Lovett J. R.; Warren N. J.; Douglas C. W. I.; Doroshenko N.; Verstraete P.; Smets J.; Armes S. P. Bespoke cationic nano-objects via RAFT aqueous dispersion polymerisation. Polym. Chem. 2016, 7 (23), 3864–3873. 10.1039/C6PY00696E. [DOI] [Google Scholar]
  59. Penfold N. J. W.; Lovett J. R.; Verstraete P.; Smets J.; Armes S. P. Stimulus-responsive non-ionic diblock copolymers: protonation of a tertiary amine end-group induces vesicle-to-worm or vesicle-to-sphere transitions. Polym. Chem. 2017, 8 (1), 272–282. 10.1039/C6PY01076H. [DOI] [Google Scholar]
  60. Cockram A. A.; Neal T. J.; Derry M. J.; Mykhaylyk O. O.; Williams N. S. J.; Murray M. W.; Emmett S. N.; Armes S. P. Effect of monomer solubility on the evolution of copolymer morphology during polymerization-induced self-assembly in aqueous solution. Macromolecules 2017, 50 (3), 796–802. 10.1021/acs.macromol.6b02309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Thompson K. L.; Cinotti N.; Jones E. R.; Mable C. J.; Fowler P. W.; Armes S. P. Bespoke diblock copolymer nanoparticles enable the production of relatively stable oil-in-water Pickering nanoemulsions. Langmuir 2017, 33 (44), 12616–12623. 10.1021/acs.langmuir.7b02267. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Byard S. J.; Williams M.; McKenzie B. E.; Blanazs A.; Armes S. P. Preparation and cross-linking of all-acrylamide diblock copolymer nano-objects via polymerization-induced self-assembly in aqueous solution. Macromolecules 2017, 50 (4), 1482–1493. 10.1021/acs.macromol.6b02643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Hatton F. L.; Lovett J. R.; Armes S. P. Synthesis of well-defined epoxy-functional spherical nanoparticles by raft aqueous emulsion polymerization. Polym. Chem. 2017, 8 (33), 4856–4868. 10.1039/C7PY01107E. [DOI] [Google Scholar]
  64. Canning S. L.; Cunningham V. J.; Ratcliffe L. P. D.; Armes S. P. Phenyl acrylate is a versatile monomer for the synthesis of acrylic diblock copolymer nano-objects via polymerization-induced self-assembly. Polym. Chem. 2017, 8 (33), 4811–4821. 10.1039/C7PY01161J. [DOI] [Google Scholar]
  65. Hunter S. J.; Thompson K. L.; Lovett J. R.; Hatton F. L.; Derry M. J.; Lindsay C.; Taylor P.; Armes S. P. Synthesis, characterization, and pickering emulsifier performance of anisotropic cross-linked block copolymer worms: effect of aspect ratio on emulsion stability in the presence of surfactant. Langmuir 2019, 35 (1), 254–265. 10.1021/acs.langmuir.8b03727. [DOI] [PubMed] [Google Scholar]
  66. Warren N. J.; Derry M. J.; Mykhaylyk O. O.; Lovett J. R.; Ratcliffe L. P. D.; Ladmiral V.; Blanazs A.; Fielding L. A.; Armes S. P. Critical dependence of molecular weight on thermoresponsive behavior of diblock copolymer worm gels in aqueous solution. Macromolecules 2018, 51 (21), 8357–8371. 10.1021/acs.macromol.8b01617. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Byard S. J.; Blanazs A.; Miller J. F.; Armes S. P. Cationic Sterically Stabilized Diblock Copolymer Nanoparticles Exhibit Exceptional Tolerance toward Added Salt. Langmuir 2019, 35 (44), 14348–14357. 10.1021/acs.langmuir.9b02789. [DOI] [PubMed] [Google Scholar]
  68. Penfold N. J. W.; Whatley J. R.; Armes S. P. Thermoreversible block copolymer worm gels using binary mixtures of PEG stabilizer blocks. Macromolecules 2019, 52 (4), 1653–1662. 10.1021/acs.macromol.8b02491. [DOI] [Google Scholar]
  69. Brotherton E. E.; Hatton F. L.; Cockram A. A.; Derry M. J.; Czajka A.; Cornel E. J.; Topham P. D.; Mykhaylyk O. O.; Armes S. P. In situ small-angle X-ray scattering studies during reversible addition-fragmentation chain transfer aqueous emulsion polymerization. J. Am. Chem. Soc. 2019, 141 (34), 13664–13675. 10.1021/jacs.9b06788. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Hatton F. L.; Park A. M.; Zhang Y.; Fuchs G. D.; Ober C. K.; Armes S. P. Aqueous one-pot synthesis of epoxy-functional diblock copolymer worms from a single monomer: new anisotropic scaffolds for potential charge storage applications. Polym. Chem. 2019, 10 (2), 194–200. 10.1039/C8PY01427B. [DOI] [Google Scholar]
  71. Gibson R. R.; Armes S. P.; Musa O. M.; Fernyhough A. End-group ionisation enables the use of poly (N-(2-methacryloyloxy) ethyl pyrrolidone) as an electrosteric stabiliser block for polymerisation-induced self-assembly in aqueous media. Polym. Chem. 2019, 10 (11), 1312–1323. 10.1039/C8PY01619D. [DOI] [Google Scholar]
  72. Ratcliffe L. P. D.; Derry M. J.; Ianiro A.; Tuinier R.; Armes S. P. A single thermoresponsive diblock copolymer can form spheres, worms or vesicles in aqueous solution. Angew. Chem. 2019, 131 (52), 19140–19146. 10.1002/ange.201909124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Blackman L. D.; Doncom K. E. B.; Gibson M. I.; O'Reilly R. K. Comparison of photo- and thermally initiated polymerization-induced self-assembly: a lack of end group fidelity drives the formation of higher order morphologies. Polym. Chem. 2017, 8, 2860–2871. 10.1039/C7PY00407A. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Langer M. F.; Goeßmann A.; Rupp M. Representations of molecules and materials for interpolation of quantum-mechanical simulations via machine learning. npj Computat. Mater. 2022, 8 (1), 1–14. 10.1038/s41524-022-00721-x. [DOI] [Google Scholar]
  75. Ghiringhelli L. M.; Vybiral J.; Levchenko S. V.; Draxl C.; Scheffler M. Big data of materials science: critical role of the descriptor. Phys. Rev. Lett. 2015, 114 (10), 105503. 10.1103/PhysRevLett.114.105503. [DOI] [PubMed] [Google Scholar]
  76. Hüllermeier E.; Waegeman W. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Mach. Learn. 2021, 110 (3), 457–506. 10.1007/s10994-021-05946-3. [DOI] [Google Scholar]
  77. Zhang M.-L.; Zhou Z.-H. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 2014, 26 (8), 1819–1837. 10.1109/TKDE.2013.39. [DOI] [Google Scholar]
  78. Dembczyński K.; Waegeman W.; Cheng W.; Hüllermeier E. On label dependence and loss minimization in multi-label classification. Mach. Learn. 2012, 88 (1–2), 5–45. 10.1007/s10994-012-5285-8. [DOI] [Google Scholar]
  79. Chen T.; Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016, 785–794. 10.1145/2939672.2939785. [DOI] [Google Scholar]
  80. Breiman L. Random forests. Machine Learning 2001, 45 (1), 5–32. 10.1023/A:1010933404324. [DOI] [Google Scholar]
  81. Friedman J. H.; Popescu B. E. Predictive learning via rule ensembles. Ann. Appl. Stat. 2008, 2 (3), 916–954. 10.1214/07-AOAS148. [DOI] [Google Scholar]
  82. Jiang X.; Zhao Y.; Liu J.; Jia B.; Qu X.; Qin M. Machine learning captures synthetic intuitions for hollow nanostructures. ACS Appl. Nano Mater. 2022, 5 (11), 17095–17104. 10.1021/acsanm.2c04016. [DOI] [Google Scholar]
  83. Murdoch W. J.; Singh C.; Kumbier K.; Abbasi-Asl R.; Yu B. Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. U. S. A. 2019, 116 (44), 22071–22080. 10.1073/pnas.1900654116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Hastie Trevor, Tibshirani Robert, Friedman Jerome H, Friedman Jerome H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, 2009; volume 2. [Google Scholar]
  85. Behler J. First principles neural network potentials for reactive simulations of large molecular and condensed systems. Angew. Chem., Int. Ed 2017, 56 (42), 12828–12840. 10.1002/anie.201703114. [DOI] [PubMed] [Google Scholar]
  86. Zhang L.; Han J.; Wang H.; Car R.; E W. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett. 2018, 120 (14), 143001. 10.1103/PhysRevLett.120.143001. [DOI] [PubMed] [Google Scholar]
  87. Hase F; Roch L. M; Kreisbeck C.; Aspuru-Guzik A. Phoenics: a bayesian optimizer for chemistry. ACS Cent. Sci. 2018, 4 (9), 1134–1145. 10.1021/acscentsci.8b00307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Yap C. W. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011, 32 (7), 1466–1474. 10.1002/jcc.21707. [DOI] [PubMed] [Google Scholar]
  89. Zhu C.; Byrd R. H.; Lu P.; Nocedal J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. 1997, 23 (4), 550–560. 10.1145/279232.279236. [DOI] [Google Scholar]
  90. Lou Y.; Caruana R.; Gehrke J.; Hooker G. Accurate intelligible models with pairwise interactions. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2013, 623–631. 10.1145/2487575.2487579. [DOI] [Google Scholar]
  91. Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58 (1), 267–288. 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ci3c00460_si_001.pdf (158.8KB, pdf)
ci3c00460_si_002.xlsx (14.7KB, xlsx)

Data Availability Statement

The software and data sets generated and/or analyzed for this study are available on GitHub at https://github.com/marioboley/PISA_ML.


Articles from Journal of Chemical Information and Modeling are provided here courtesy of American Chemical Society

RESOURCES