Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Oct 22.
Published in final edited form as: ACS Nano. 2024 Oct 7;18(42):28735–28747. doi: 10.1021/acsnano.4c07615

Machine Learning Elucidates Design Features of Plasmid DNA Lipid Nanoparticles for Cell Type-Preferential Transfection

Leonardo Cheng 1,2,3, Yining Zhu 1,2,3, Jingyao Ma 1,3,4, Ataes Aggarwal 1,2,5, Wu Han Toh 1,3,5,6, Charles Shin 1,2, Will Sangpachatanaruk 1,5,7, Gene Weng 1,3,8, Ramya Kumar 9, Hai-Quan Mao 1,2,3,4
PMCID: PMC11512640  NIHMSID: NIHMS2029282  PMID: 39375194

Abstract

To broaden the accessibility of cell and gene therapies, it is essential to develop and optimize non-viral, cell type-preferential gene carriers such as lipid nanoparticles (LNPs). While high-throughput screening (HTS) approaches have proven effective in accelerating LNP discovery, they are often costly, labor-intensive, and do not consistently yield actionable design rules that direct screening efforts toward the most relevant chemical and formulation parameters. In this study, we employed a machine learning (ML) workflow, utilizing well-curated plasmid DNA LNP transfection datasets across six cell types, to extract chemical insights from HTS studies. Our approach achieved prediction errors averaging between 5 and 10%, depending on the cell type. By applying SHapley Additive exPlanations (SHAP) to our ML models, we uncovered key composition-function relationships that govern cell type-preferential LNP transfection efficiency. Notably, we identified consistent LNP composition parameters that enhance in vitro transfection efficiency across diverse cell types, including a helper lipid molar percentage of charged lipids between 9 and 50% and the inclusion of cationic/zwitterionic helper lipids. Additionally, several parameters were found to modulate cell type-preferentiality, such as the total molar percentage of ionizable and helper lipids, N/P ratio, PEGylated lipid molar percentage of uncharged lipids, and the hydrophobicity of the helper lipid. This study leverages HTS of compositionally diverse LNP libraries combined with ML analysis to elucidate the interactions between lipid components in LNP formulations, providing insights that contribute to the design of LNP compositions tailored for cell type-preferential transfection.

Keywords: lipid nanoparticle, high throughput screening, machine learning, SHapley Additive exPlanation, composition-function relationship, cell type-preferential transfection

Graphical Abstract

graphic file with name nihms-2029282-f0001.jpg

INTRODUCTION

Gene and cell therapies have increasingly become a promising approach to combat a variety of diseases from genetic disorders to cancer1. Although these therapeutics have proven efficacious across a variety of applications, reliable and targeted delivery of genetic payloads to specified in vivo and in vitro cell and tissue targets remains elusive2. Most FDA-approved gene and cell therapies rely on viral gene delivery, which are limited by payload packaging capacity, compromised therapeutic outcomes due to pre-existing antibodies, and patient immune response, resulting in efficacy and safety concerns35. These challenges continue to provide impetus for the development of non-viral carriers such as lipid or polymeric nanoparticles as alternative solutions for gene therapies. Lipid nanoparticles (LNPs) have demonstrated remarkable potential as an effective carrier in nucleic acid delivery in vitro and in vivo68. The approvals of siRNA LNPs (Onpattro) and two mRNA LNP-based SARS-COV-2 vaccines, Spikevax® (Moderna) and Comirnaty® (BioNTech/Pfizer), have solidified LNPs as a leading non-viral carrier for gene therapeutics9. Despite success in gene vaccine delivery, LNPs are not a “one formulation fits all” delivery modality; and the design space of LNP formulations must be expanded and refined to address factors such as cell type-preferential transfection, tissue/organ targeting, and route of administration10.

LNPs are classically composed of four lipid components: ionizable cationic lipids, helper lipids, cholesterol, and PEGylated lipids10,11. Ionizable cationic lipids, which typically incorporate tertiary amines with pH-sensitive protonation equilibria, primarily package anionic payloads (nucleic acids) and facilitate endosomal escape. Several studies have highlighted the importance of incorporating additional lipids such as helper lipids to promote cellular association and targeting, cholesterol for fusogenicity, structural integrity, and stability, and PEGylated lipids to modulate plasma protein interactions, LNP colloidal stability, and LNP biodistribution and pharmacokinetics in vivo10,1214. Nevertheless, while the roles of ionizable lipids have been extensively mapped, the roles of other lipid components, particularly helper lipids, and the combined interactions of lipid components with each other remains poorly understood, thus limiting our ability to rationally design and screen LNP compositions. LNP optimization is notoriously inefficient because LNP composition parameters compose a vast and relatively unexplored design space. Unless LNP optimization is guided and informed by fundamental insights into the roles of lipid chemistry and formulation parameters, LNP design will largely remain subject to trial-and-error experimentation. Therefore, prudent strategies to probe and characterize this vast design space are necessary.

High throughput screening (HTS) studies have become an increasingly routine approach to optimize new LNP formulations. The majority of LNP HTS studies have focused on synthesizing and screening new ionizable lipids to achieve optimal LNP formulations. However, they offer very limited chemical insights into the roles of helper lipid chemical identity and overall formulation compositions1518. Our recent study on HTS of LNPs highlighted that designing effective LNP formulations requires extensive optimization of composition and careful selection of compatible helper lipid chemistries19. Specifically, our HTS identified top LNP compositions from a 1080-formulation library (varying helper lipid charge and lipid component ratios) that effectively transfected hepatocarcinoma cells in vitro and more importantly achieved successful delivery to the liver after intrahepatic or intravenous injection. Another recent study showed that cell specific transfection can be achieved by modulating formulation composition and helper lipid chemical identity without altering the ionizable lipid structure20.

Machine learning (ML) algorithms combined with downstream analysis techniques are promising to explore and understand the interplay between multiple design parameters governing LNP transfection efficiency. Developing and validating ML workflows that are robust to experimental noise can circumvent the reliance on exhaustive and tedious large-scale screening. Given sufficiently large well-curated training data, ML algorithms can identify complex relationships between input parameters (LNP features) and output responses (transfection performance). Several recent reports showcased the ability of ML approaches to predict how carrier attributes (e.g., size, polydispersity index, and ζ-potential) influence biological responses such as transfection efficiency and cell viability of LNPs and polyplexes2125. ML also assisted the development of new ionizable lipid chemistries but the role of helper lipid chemistry and lipid component ratios has not been extensively explored within a ML framework22,24.

Several common ML algorithms have been utilized in the analysis of non-viral gene carrier systems. Most are considered complex “black box models”, which renders interpretation and elucidation of structure-function (transfection efficiency) relationships challenging. SHapley Additive exPlanations (SHAP), an analysis framework grounded in game theory, has been shown to robustly interpret findings from ML analysis by providing a unified measure of feature importance26. Here, we used transfection dataset of an LNP library comprising 1,080 LNP formulations by varying LNP compositions and helper lipid chemistry. This compositionally diverse LNP library was tested in 6 cell types and their in vitro transfection performance assessed19,20. We developed a computational pipeline that will select, train, and refine high performing ML algorithms that predict transfection outcomes as a function of LNP design features in each cell type. We then conducted comparative SHAP analysis of feature importance to elucidate structure-function relationships governing cell type-preferential transfection. The integration of experimenter-directed HTS with a robust ML pipeline presents a new approach to confidently decipher the structure-function relationship of LNP formulations.

RESULTS AND DISCUSSION

Helper lipid and LNP composition modulate cell type-preferential transfection efficiency.

The dataset used to train the ML model was produced using an HTS approach for a nucleic acid-loaded LNP library that we generated and characterized previously19,20,27 (Figure 1a) in B16F10 (murine melanoma), HepG2 (human hepatocellular carcinoma), HEK293 (human embryonic kidney), PC3 (human prostatic adenocarcinoma), ARPE-19 (human retinal pigment epithelia), and N2a (murine neuroblast) cell lines. Each formulation composition contains DLin-MC3-DMA as the ionizable lipid, cholesterol, DMG-PEG2000 as the PEGylated lipid, and one helper lipid selected from a group of 6 carrying different charge types: cationic head groups (DOTAP, DDAB), zwitterionic head groups (DOPE, DSPC), and anionic head groups (18PG, 14PA) based on our previous screening studies19. The following four independent composition parameters were adjusted: (i) total molar percentage of charged lipids, i.e. ionizable lipid (IL) and helper lipid (HL), ranging from 20–80% (“(IL+HL)%”), (ii) helper lipid molar percentage of charged lipids ranging from 0.5–50% (“HL/(IL+HL)%”), (iii) PEGylated lipid molar percentage of uncharged lipids, i.e. cholesterol (Chol) and PEGylated lipid (PEG), ranging from 0.2–9% (“PEG/(Chol+PEG)%”), and (iv) protonatable nitrogen to phosphorous ratio (”N/P ratio”) ranging from 4 to 12 (Supplementary Table 1 and Supplemental Figure 1a). By varying these composition parameters, 180 unique LNP formulations were generated from each helper lipid. Initially seven molecular descriptors were generated based on the helper lipid chemical structure using the most distinctive chemical structural features to parameterize the structure of the helper lipids (Supplemental Table 2 and Supplemental Figure 1b), and prioritized based on the chemical interpretability of the results28. For example, the number of positively/negatively charged centers and the number of hydrogen bond acceptors/donors quantify the capacity of helper lipids to form electrostatic interactions and hydrogen bonds within a pDNA LNP. Other features such as the number of double bonds, the total number of carbons atoms within the lipid tails, and calculated LogP (cLogP) may characterize the structural rigidity and hydrophobicity of the helper lipid11,29. In addition to helper lipid chemical features, particle-level physiochemical properties such as LNP size and polydispersity, and ζ-potential were included (Supplemental Figure 1c).

Figure 1.

Figure 1.

Schematic of high throughput screening and ML pipeline. (a) Data generation using an LNP library and screening platform tested in 6 different cell types. (b) Model selection and hyperparameter optimization using a nested cross validation approach and model input feature refinement. SHapley Additive exPlanations (SHAP) was used to analyze optimized models and elucidate design rules for cell type-preferential LNP formulations.

Transfection efficiency measured by the relative luminescence units (RLU) obtained with this LNP library was sufficiently broad (e1.5 – e12 RLU/well) for most cell types. Broadly, the transfection efficiency was sensitive to the cell type, the chemical identity of the helper lipid, and LNP composition (Figure 2a). We observed non-overlapping transfection efficiency trends in each cell type, indicated by the low (r < 0.35) or moderate rank correlation (0.36 < r < 0.67)30 of formulation-dependence in transfection efficiency between each pair of cell types (Figure 2b). A low Spearman’s rank correlation coefficient indicates that LNP features responsible for transfection efficiency diverge between cell types whereas a high Spearman’s rank coefficient suggests overlapping LNP design criteria shared by the two cell types. For most part, LNPs that performed well in one cell type typically underperformed in others. The two exceptions were N2a and ARPE-19 cells, which were the most highly correlated cell lines with a Spearman’s rank correlation coefficient of 0.71. We attributed the high correlation to the shared neuronal origins of ARPE-19 and N2a cells31,32. Our results suggest that it may be possible to exploit interactions between LNP design features and certain cellular features of the target cell type to develop tailored LNPs for cell type-preferential transfection through modulation of the formulation compositions and helper lipid identity. Further analysis of the Spearman’s rank coefficients between subsets of formulations containing the same helper lipid revealed that specific formulation composition parameters were highly conserved while using DDAB, moderately conserved using DOTAP and DOPE, and weakly conserved in DSPC, 18PG, and 14PA. Our analysis establishes that the choice of helper lipid impacts the extent to which LNPs selectively mediate efficient pDNA delivery in one cell type but not in others (Supplementary Figure 2).

Figure 2.

Figure 2.

Preliminary comparison of screening results across the cell lines. (a) Distribution of transfection efficiency of each cell type colored by helper lipid used in formulation. Gray dotted line indicates the value used to floor the dataset during preprocessing (b) Heatmap of Spearman’s rank correlation coefficient of LNP formulation transfection between cell types after data preprocessing.

Decision tree-based algorithms describe complex non-linear relationships between LNP formulation features and transfection efficiency.

The data set was preprocessed to facilitate comparative analysis between cell type-preferential models (see Methods for details). Importantly, the transfection efficiency values within a specific cell type dataset were normalized between a range of 0 (min) to 1 (max) based on the maximum transfection achieved to eliminate the impact of overall cell type “transfectability” from downstream analysis.

Various ML algorithms are available to capture trends in datasets; however, certain algorithms may perform better than others depending on the complexity and type of relationship between the input and output parameters. Furthermore, while a model may be able to fit a relationship to training data precisely, researchers must be wary of the harmful impacts of over/underfitting and lack of training data influencing model robustness. To address these concerns, model selection was conducted to select the best performing algorithms with minimal overfitting and validate whether the training data was sufficiently large. A nested-cross validation approach was used to compare performance of models from a panel of regression analyses, nearest neighbors’ algorithms, neural networks, and decision tree-based algorithms33. We split the dataset using an 85:15 train-test approach; this split was performed only once (using random selection), and the 15% hold-out validation dataset was insulated from hyperparameter optimization and used solely for scoring model performance. Next, the remaining 85% training data was further split into two sets of 68–17. To do this, we fed the training data into a 5-fold nested cross-validation (outer loop), which randomly selected 80% of input training data (or 68% of the total data) for model development and hyperparameter optimization (inner loop). The remaining 20% of the training dataset (or 17% of the total data) was used as test data (Figure 1b). In summary, we created a 68:17:15 inner loop: outer loop: hold-out validation split of all data, allowing the validation split to be completely insulated from the nested cross-validation procedure. Within the inner loop (68% of the total data), we applied another 4-fold cross-validation split and conducted 100 iterations of hyperparameter optimization using a random grid search (Supplementary Table 4). The best model hyperparameters were selected by minimizing the mean absolute error (MAE) between the inner loop test set and model predictions. After hyperparameter optimization, the model was evaluated on the outer loop test set (17%) to determine the outer loop test error. Our nested cross-validation procedure identifies 5 sets of optimized hyperparameters for each model with associated inner loop and outer loop test errors. From these sets of optimized hyperparameters, the final set of hyperparameters was chosen to reduce model overfitting by minimizing the difference between the inner loop and outer loop test errors. Finally, the model fit on the final set of hyperparameters was then evaluated on the hold-out validation dataset (15% of total data insulated from model training and testing in the outer and inner loops) to compute the validation error. Through this approach, we measured the accuracy of our predictive pipeline on validation data sets that were insulated from model training and testing. Our robust model selection approach integrated rigorous hyperparameter optimization and balanced the validation and test error, thus ensuring that high performing and minimally overfitting model architectures were chosen.

Across all cell types, the best performing models were achieved for the B16F10, HepG2, and PC3 cell types using the LightGBM (LGBM) model achieving a low mean absolute error (MAE) of below 0.06 (i.e., predicted transfection efficiency with below 6% error on average) (Supplementary Table 5) when evaluated using the hold-out validation set. For the B16F10 cell type, model selection established that the LGBM model performed the best among all tested models (Figure 3a). The LGBM model achieved impressive Pearson’s correlations (r = 0.94), and Spearman’s rank (r = 0.92) on the hold-out validation set (Figure 3b). In vitro validation of novel formulation compositions absent from the training data emphasizes our strong model predictive abilities in unexplored feature spaces (Supplementary Figure 3, Supplementary Table 6). We also assessed our model’s ability to predict novel helper lipid formulations by conducting leave-one-lipid-out (HL-1) stratification of our dataset. We iteratively withheld formulations from a single helper lipid class from the training dataset and trained our models on the HL-1 dataset and tested the model on the withheld candidate. While model MAE increased from around 5% to between 6–16% (depending on the withheld helper lipid), the generalization performance was better than expected given the small and restricted helper lipid library (Supplementary Figure 4). Interestingly, decision tree-based models (LGBM, XGB, RF) performed significantly better than many other tested models, however, there was little difference in performance among the decision tree-based models (Figure 3a). Unlike regression models, decision tree-based models fit data against ordered sets of yes/no decisions (“nodes”) to determine the predicted output. As such, this method fits non-linear, complex relationships more accurately than other algorithms such as linear regression-based, which are better suited for simpler, linear relationships described by explicit mathematical equations34. Given that the decision-tree based models most effectively analyzed the dataset, it suggests that the relationships between formulation parameters and transfection efficiency are non-linear and complex otherwise regression models would have performed equally well. Similar results have been reported in other comparable tasks where decision-tree based models outperform various other model architectures3540. We achieved minimal error (MAE < 0.10), strong Pearson’s correlation (r > 0.8), and moderate to strong Spearman’s rank correlation (r > 0.7) between the predictions and hold-out experimental values across all cell types (Supplementary Table 4, Supplementary Figure 5). We also observed similar trends where decision-tree based models performed significantly better than regression and more specifically, we found LGBM models performed the best for all cell types (Supplementary Figure 6). Given the difficulty of modeling non-linear complex relationships between the formulation features and transfection across all cell lines, our results establish that decision-tree based ML models most reliably capture and model critical features underlying LNP transfection performance.

Figure 3.

Figure 3.

Model selection and hyperparameter optimization for B16F10 cell type. (a) Boxplot of hold-out set error centered around the MAE (red triangles), and dots represent individual hold-out set absolute error. Models are ranked from lowest (left) to highest (right) MAE. (b) Hold-out set experimental ground-truth normalized transfection vs. optimized LGBM model-predicted normalized transfection of the hold-out validation set (n = 162). Model predictions mirrored experimental results. (c) Training and validation error dependent on training set size of the selected LGBM model for B16F10. Convergence and plateauing of training and validation curves around 700 formulations indicate that our training dataset (1,080 formulations) exceeded the minimum necessary for robust model training. (d) Input features are removed from the training data until the removal of more features would lead to worse model performance. B16F10 LGBM model requires only 8 out of 14 features for optimal performance. (e) Comparison of refined model performance (MAE) across cell types. (f) Average MAE over 20 trials of randomized 5-Kfold test-train splits of straw models tested by shuffling input feature values. Shuffling of polar intermolecular interaction features (hydrogen bond donors/acceptor and positively charged centers), apolar features associated with lipid hydrophobicity/rigidity (cLogP and double bonds in tails), and all helper lipid features (both polar and apolar) resulted in significant decrease in model performance. Statistical analysis was conducted using one-way ANOVA with Tukey’s and Dunnett’s multiple comparisons tests for (a) and (f) respectively. *p < 0.05 , **p < 0.01., ***p < 0.001. ****p < 0.0001.

We attribute our high model accuracies to our careful model selection process and the integration of HTS data generation ensuring the availability of reliable and well-curated datasets that focused primarily on formulation composition parameters. To validate whether the training dataset size was sufficiently large for model training, the best performing model for each cell type was retrained with a varying number of randomly selected datapoints, and the validation loss (MAE when predicting hold-out data) and the training loss (MAE when predicting training data) were recorded. In B16F10 cells, the validation loss approached the training loss; and an overall plateau of the loss curves was observed, indicating that the size of the training dataset comfortably exceeded the minimum needed for model training and validation, and that minimal overfitting was observed (Figure 3c). Similar trends were observed in the other cell types (Supplementary Figure 7). These results highlight the benefit of integrating HTS data generation in our computational analysis. Our approach also provides a major improvement for downstream mechanistic analysis compared to many other studies within the LNP field21,22,24. We not only ensured that the training data gathered was sufficient to construct robust and reliable models, but also minimized lab-to-lab variation and experimental reproducibility issues because all data is gathered under the same test conditions and with the same protocols. The consistency and comparability of our dataset, combined with our focus on data diversity (ensuring that low, moderate, and high performing formulations were well-represented) sets this study apart from previous work. Previously reported studies, in contrast, relied on training models with relatively small, non-curated datasets mined from historical data or literature, which often times used inconsistent protocols and narrowly focused on “optimized” results, therefore provided little understanding of why most LNP formulations fail22,41. In contrast, our approach leverages a larger and reliable dataset that is equally inclusive of both optimal and non-optimal formulations, enabling us to train trustworthy models that capture trends governing LNP transfection performance.

Feature refinement establishes that chemical design features of lipid components are more predictive of transfection than nanoparticle-level design features.

We next conducted feature refinement aiming to remove unnecessary features and mitigate the risk of mistakenly deriving spurious relationships between certain features. By comparing the Ward linkage distances from unsupervised hierarchical clustering using the farthest neighbor algorithm (Supplementary Figure 8), highly clustered features were removed from the model inputs individually. If the model performance worsened, the feature was restored33. Feature refinement for the B16F10 cell type showed that the model only required 8 of the initial 14 features (Figure 3d). This process was repeated for each cell line individually. Impressively, our models fit using refined feature sets for B16F10, HepG2, PC3, and N2a achieved around 5–6% error on average, and models for all cell types achieved less than 10% error on average (Figure 3e, Table 2). Of the initial 14 features, models for B16F10 and HepG2 retained 8 and 9 features, respectively, while models for PC3, HEK293, N2a and ARPE-19 retained only 7 features (Table 2, Supplementary Figure 9). All four formulation composition features were retained for every cell type. Of the helper lipid chemical features, only the number of positively charged centers was retained across all cell types while cLogP was retained in all cell types except N2a. Other helper lipid features such as number of hydrogen bond acceptors/donors, carbons in the lipid tails, and double bonds in the lipid tails were necessary for certain cell type models to perform optimally, but different features were retained in each model (Table 2). The number of positively charged centers contributes to the overall charge of LNPs; and cationic/zwitterionic lipids were identified more frequently in high transfecting formulations in all cell types (Figure 2a). The number of hydrogen bond donors/acceptors characterizes potential hydrogen bond interactions between lipid components and the nucleic acid payload. Non-electrostatic helper lipid features helped the model differentiate between similarly charged helper lipids, particularly characterizing the structure and chemical properties of lipid tails rather than headgroup, emphasizing the importance of holistically considering all chemical properties within LNP formulations. Notably, all nanoparticle-level features such as size, PDI, and ζ-potential were consistently removed, and the exclusion of these features almost always improved model performance (Supplementary Figure 9). It is possible that these physiochemical features introduced potentially spurious contributions to predictions in these cell types. This result may have been influenced by the biased distribution of LNP sizes (the majority within the 100–300 nm range) and ζ-potential values (the majority within −10 to +10 mV) (Supplementary Figure 1c). As such, there simply was not enough variability in these features for these nanoparticle-level features to contribute to transfection efficiency trends. This conclusion should also be considered within the context of LNP uptake mechanism. Most of the cell types studied favor 100–300-nm and neutral or slightly positive charged particles42. Since the majority of our LNP formulations fall within these size and ζ-potential ranges, the tested models do not need these nanoparticle-level features to make accurate predictions. Overall, the feature refinement process enabled the ML pipeline to retain the most critical input parameters and further improved performance, highlighting the importance of LNP composition and helper lipid choice. Additionally, this process will aid model interpretation and derivation of structure-function relationships during post-hoc SHAP analysis.

Table 2.

Model refinement results using LGBM in different cell types

Cell type B16F10 HepG2 PC3 HEK293 N2a ARPE-19
Number of features needed 8 9 7 7 7 7
Composition features retained All All All All All All
Helper lipid features retained (+)
positively charged centers + + + + + +
cLogP + + + + - +
hydrogen bond donors - + - - - +
hydrogen bond acceptors - + + - + -
carbons in tails + + - - - -
double bonds in tails + - - + + -
Physicochemical features retained None None None None None None
MAE average 0.048 0.058 0.060 0.081 0.062 0.097
Spearman average 0.925 0.749 0.720 0.839 0.887 0.870
Pearson average 0.940 0.881 0.905 0.822 0.906 0.875

cLogP, calculated LogP; MAE, mean absolute error.

Further interrogating the refined models and the chemical design of the helper lipid, we conducted straw model experiments by randomly shuffling values of selected features and evaluating model performance. Specifically, we hypothesized that polar (hydrogen bond donors/acceptors, positively charged centers, and negatively charged centers) and apolar (cLogP, carbons in tails, double bonds in tails) intermolecular interactions are key descriptive features of the helper lipids. We observed significant loss in model performance in the B16F10 cells when trained on shuffled feature sets, and comparable results in all other cell types (Figure 3f, Supplementary Figure 10). This result confirms our hypothesis and further ensures that our models are truly reflective of underlying structure-function trends and are not arising from data artifacts. Interestingly, simultaneously shuffling both apolar and polar features dramatically decreases model performance. This decrease in model performance exceeded the cumulative performance loss observed when we separately shuffled the apolar and polar features. This implies that our models identified critical interactions between polar and apolar helper lipid features; these interactions will be investigated with a broader helper lipid library in future studies.

Feature importance illuminates cell type-preferential LNP formulation design rules.

The SHAP summary plot for the B16F10 cell type mapped the feature importance and illustrates structure-function relationships unveiled by the LGBM model and identifies relative feature values associated with enhanced or diminished transfection efficiency (Figure 4a, Supplementary Figure 11). Composition parameters were strongly predictive of transfection efficiency. Helper lipid molar percentage of charged lipids, HL/(IL+HL)%, between 9–50%, PEGylated lipid molar percentage of uncharged lipids, PEG/(Chol+PEG)%, approaching 0.2–1%, and N/P ratios approaching 12 resulted in strongly positive contributions to transfection efficiency in B16F10 cells. We did not observe as obvious trends for the total molar percentage of charged lipids, (IL+HL)%, suggesting complex interactions between lipid components. One such interaction observed was that lower total molar percentage of charged lipids around 20–40% and lower PEGylated lipids percentage near 0.2–1% tend to enhance their respective positive influence on transfection efficiency (Supplementary Figure 12). Of the helper lipid features, cLogP near 12.7–12.9, positive charges (cationic or zwitterionic lipid head groups), greater saturation of helper lipid tails (0 double bonds), and lower tail length (28 total carbons) improved transfection in B16F10 cells (Figure 4a, Table 3).

Figure 4.

Figure 4.

SHAP elucidates LNP structure-function relationships for B16F10 transfection. (a) SHapley Additive exPlanations (SHAP) summary plot describing the impact of feature values on final model output for the refined B16F10 LGBM model. Features are found on the y-axis, grouped by feature type (composition or helper lipid). Each point represents a single LNP formulation, and the color represents the relative feature value within the formulation. Points located to the right of the x-axis (x>0) indicate positive impact on predicted transfection, while points located to the left (x<0) indicate negative impact. (b–d) LNP formulations represented by individual dots embedded on 2 dimensions by t-SNE dimensional reduction of all SHAP values. (b) Formulations are colored by feature value used. (c) Formulations are colored by feature SHAP value. (d) Formulations colored by ground-truth normalized transfection efficiency in B16F10 cells. Black arrows indicate potential regions for improved LNP transfection efficiency.

Table 3.

Machine learning elucidated design rules for optimal cell type-preferential transfection

Feature B16F10 HepG2 PC3 HEK293 N2a ARPE-19
HL/(IL+HL) mol% 9.09 50 50 9.09 50 50
(IL+HL) mol% 40 80 60 60 60 20
PEG/(Chol+PEG) mol% 0.99 0.99 0.99 0.99 0.20 0.99
N/P Ratio 12 12 12 12 4 8
Positively Charged Centers 1 1 1 1 1 1
cLogP 12.515 12.73 12.73 9.784 N/A 12.515
Hydrogen Bond Donors N/A 3 N/A N/A N/A 2
Hydrogen Bond Acceptors N/A 9 9 N/A 9 N/A
Carbons in Tails 28 28 N/A N/A N/A N/A
Double Bonds in Tails 0 N/A N/A 2 2 N/A

IL, ionizable lipid; HL, helper lipid; (IL+HL), total molar percentage of charged lipids; HL/(IL+HL), helper lipid molar percentage of all charged lipids; Chol, cholesterol; PEG, DMG-PEG2000 lipid; PEG/(Chol+PEG), PEGylated lipid molar percentage of all uncharged lipids; N/P ratio, nitrogen/phosphate ratio; cLogP, calculated LogP; mol%, molar percentage.

Visualizing the location of feature values and SHAP values of the top 3 important features in 2-dimensional space by t-distributed stochastic neighbors embedding dimensional reduction provides deeper insight into formulation design rules (Figure 4bc). There are two separate local or potentially global transfection maxima that can be identified within the formulation parameter space screened for B16F10 cells (Figure 4d). By identifying formulations residing in regions of high transfection efficiency, we discovered LNP design rules from the refined LGBM model. Helper lipid cLogP (near 12.7–12.9), PEGylated lipid molar percentages of uncharged lipids approaching 1%, and helper lipid molar percentage of uncharged lipids between 9–50% were all clearly localized with regions of high transfection (Figure 4b). Similarly, high SHAP values colocalized in regions of high transfection and display the additive effect of optimizing each formulation parameter to improve transfection (Figure 4c). Other composition and helper lipid features values also show colocalization with regions of high transfection efficiency, albeit less significantly, further emphasizing the complexity of LNP formulation design (Supplementary Figure 13, 14). The open regions in the 2D plots (highlighted with arrows in Figure 4d), particularly where high values of experimental transfection efficiencies and SHAP values are localized, identifies feature spaces where improved LNP formulations may be discovered (Figure 4d). In future LNP libraries for B16F10 transfection, we will prioritize designing formulations with helper lipid cLogP near or higher than 12.7–12.9, helper lipid molar percentages of charged lipids between 9–50%, and PEGylated lipid molar percentages of uncharged lipids 0.2–1% to augment transfection performance.

SHAP feature importance analysis was repeated for all six cell types to identify LNP formulation design rules for cell type-preferential transfection (Supplementary Figure 15). Each cell type requires distinct and mutually exclusive sets of different formulation parameters to optimize transfection (Table 3, Figure 5, Supplementary Figure 16). We attribute this result to the unique uptake and intracellular trafficking characteristics of each cell type. SHAP lends credence to our hypothesis that cell type-preferential transfection in vitro can be accomplished by modulating the formulation parameters. Notably, cell types share multiple overlapping composition features values. Most prominently, helper lipids containing cationic or zwitterionic head groups (1 positively charged center) and low PEGylated lipid content (0.2–1% PEGylated lipid of uncharged lipids) were associated with optimal transfection in all cell types. Our analysis also shows that it is essential to optimize the total charged lipid percentage, helper lipid molar percentage of charged lipids, and N/P ratios, highlighting that even subtle alterations to formulation recipes can have far-reaching effects on transfection outcomes. SHAP analysis is effective in uncovering LNP design parameters preferred by each cell type and provides a promising approach for ML-guided design of bespoke LNPs tailored to the unique requirements of each cell type.

Figure 5.

Figure 5.

ML suggested design features of LNPs for cell type specific transfection. Input features (y-axis) grouped by feature type (composition, helper lipid, and physiochemical) and suggested LNP parameter values (color) for optimal cell type (x-axis) transfection based on SHAP values from optimized and feature-refined models for each cell type. Feature importance is indicated by dot size. Optimal relative feature values are shown by color. Exact optimal feature values can be found in Table 3. Features that are excluded from refined input features for that cell type are labeled “N/A”. Differences in optimal parameter values and feature importances between cell types further highlight that LNP formulation parameters must be optimized for cell type-preferential transfection.

Through careful quantification and feature importance analysis, this study has formalized several LNP composition design heuristics that were hitherto considered “common knowledge” but not formally tested. Compositional features that define effective viable LNP formulations, such as moderate to high helper lipid molar percentage of charged lipids, low PEGylated lipid content, and cationic/zwitterionic helper lipid charge, were generally conserved within the optimal LNP designs for our tested cell types and are clearly reflected in the most recent COVID-19 mRNA vaccines (Spikevax® (Moderna) and Comirnaty®). Importantly, our pipeline significantly improves the current understanding of these LNP features by developing a quantitative approach to characterize the contributions of compositional and helper lipid chemical features.

Moreover, we have identified the most critical features and associated optimal feature values that impact LNP cell type-preferential tropism (Figure 5, Table 3). For example, while each individual composition feature falls within “common design” heuristics, no complete set of optimal composition features are shared between all cell types. Furthermore, the importance of each feature varies dramatically between each cell type (Figure 5, Supplementary Figure 11). In terms of lipid compositions, we observed that the PEGylated lipid percentage of uncharged lipids is more influential in ARPE19, B16F10, HEK293, and N2a, while the helper lipid percentage of charged lipids is more influential in HepG2 and PC3. This analysis also identified a potentially critical helper lipid feature that impacts cell type tropism, namely helper lipid hydrophobicity (cLogP). In our dataset, SHAP analysis revealed that cLogP plays a critical role in B16F10 and PC3, a moderate role in HEK293, HepG2, and N2a cells, and is a bystander variable in ARPE-19 cells. Overall, we have confirmed many common design heuristics including effective compositions design spaces and cationic/zwitterionic helper lipid charge, while establishing a formal quantitative approach to characterize the contributions of specific design features and identifying key design features and optimal values governing LNP-preferential transfection in different cell types.

Given the overall importance of chemical features on LNP performance, we were interested in whether our models were truly able to effectively capture and utilize the encoded chemical features of our helper lipids. We first introduced a negative control using one-hot-encoding (OHE) of helper lipid class rather than chemical encoding and repeated our model selection and optimization protocol. The model MAE (tested on 15% of the total dataset randomized but stratified to contain the relative helper lipid proportions found in the total dataset) after model selection shows that chemically-encoded models and OHE models perform similarly across all six cell-types (Supplementary Figure 17). We next conducted leave-one-lipid-out (HL-1) studies on the OHE models to see their ability to predict novel lipid chemistries. For B16F10 models, we found that when certain helper lipids were left out (14PA, DOTAP) the chemically encoded models performed significantly better than OHE models (Supplementary Figure 18). While this was not the case for all cell types and helper lipids, it appears that chemically encoded models may actually incorporate information about lipid properties (such as hydrophobicity) to predict transfection performance. Importantly, we expect chemically encoded models to improve their prediction accuracy in novel lipid candidates once we diversify and expand our helper lipid library.

On the other hand, with OHE models the dimensionality of feature vectors will increase rapidly as we expand the helper lipid library, requiring special techniques to mitigate sparsity and high dimensionality. Finally, we developed a blended model trained on data where both OHE and chemical properties were encoded. Afterwards, we conducted feature ablation and SHAP feature importance studies, which revealed that blended models placed lower weights on OHE parameters than helper lipid chemical parameters. Furthermore, cLogP and number of positively charged centers remained the most critical chemical features driving transfection performance in most cell types (Supplementary Figure 19). Although chemically encoded models do not outperform OHE models, our results suggest that chemically encoded models still effectively identify and utilize chemical features. We will continue to test chemically encoded models on additional lipid candidates and test whether they provide a clear advantage over OHE while rationally exploring the vast chemical space available for designing helper lipids.

Study Limitations.

In this proof-of-concept study, we utilized a high-throughput readout of transfection efficiency (i.e., using luciferase assays) to characterize our LNP pDNA delivery performance. Although luciferase expression is convenient for screening large (1080 formulations) LNP libraries, this assay has certain limitations. Unlike flow cytometry, which quantifies transgene expression at the single cell level, luciferase assays provide bulk transgene readouts. Nevertheless, flow cytometry sample preparation is laborious and time-consuming, leading us to employ approximate transfection characterization from luciferase measurements, which is a convenient and rapid plate-reader based assay. Further, we did not measure cellular uptake and cytotoxicity in this study, but we are developing high-throughput tools to collect these measurements. This information could provide additional insight into structure-function relationships of more effective LNP formulations. Nonetheless in future work, toxicity, uptake, and flow cytometric data could be easily integrated into the current ML pipeline.

Our study designed our LNP library based on a previously reported grid-search of the parameter space19,20. However, this approach often leads to under-sampling of high-performing regions. This suggests that more adaptive library design, such as an iterative screening using computational based optimization methods, could more effectively map the parameter space, leading to better understanding of the structure function relationships. The same approach can be used to iteratively optimize LNP formulations. Additionally, our ML pipeline utilizes chemical properties rather than direct chemical structures to parameterize lipid structures. While our pipeline yields interpretable insights, it can only list desirable lipid properties, but cannot generate structural predictions for novel lipid candidates, leaving this task to the human experimenter. At present, we cannot confidently deploy our model for predictive tasks in novel lipid formulations for two reasons: our lipid parametrization approach will benefit from incorporating more complex descriptors such as fingerprint descriptors and our dataset must incorporate a greater variety and number of helper lipid candidates. We anticipate that further work is needed to evaluate model performance in formulations falling outside the chemical domains tested in our work. As such, the model performance metrics reported in this paper are applicable only to lipid formulations falling within the bounds of Supplementary Table 2. Nevertheless, future work with expanded lipid libraries can employ methods such as chemical fingerprinting43 in addition to chemical properties to achieve better functionality. Cell type-specific training data and their associated artifacts are handled through self-normalization of transfection efficiency and creation of cell-specific models before our comparative analysis. As such, multiple cell type datasets cannot be effectively combined into a single integrated model. Our pipeline can somewhat overcome this hurdle by providing effective comparison of LNP design rules discovered by cell type-specific models. To better address this limitation, characterizing governing properties of target cells, such as cell surface features and GAG expression patterns, membrane composition, endocytic preference, etc., could enable the development of unified models.

CONCLUSION

LNP screening has primarily focused on discovering and validating new ionizable lipid chemistries while marginalizing complex interactions between lipid components (helper lipids, ionizable lipids, PEGylated lipids, and sterols). We developed a machine learning pipeline to exploit large well-curated datasets from HTS and elucidate important structure-function relationships of different LNP compositions. We trained, optimized, and validated machine learning models across multiple cell types. After model selection and hyperparameter optimization, we identified LNP design features and formulation parameters that exerted the greatest influence over transfection efficiency. SHAP analysis elucidated the physicochemical drivers of high-performing LNP formulations for each cell type. Importantly, our comparative analysis demonstrates that we must delicately modulate helper lipid chemical parameters and formulation composition to achieve cell type-preferential transfection.

While this study primarily focuses on structure-function relationships driving pDNA LNPs delivery performance, particularly helper lipid chemical identity and LNP formulation parameters, our approach is generalizable to a larger chemical design space. Our pipeline is now ideally positioned to investigate expanded parameters spaces opened up by new lipid chemistries, the addition of more lipid components (such as SORT lipids), and new payload types including mRNA and other RNA therapeutics44. The inclusion of additional input features characterizing the LNP formulation such as encapsulation efficiency, payload distribution, or protein corona composition would potentially improve the model’s understanding of the LNP design rules. Future work will include the addition of more structure- and property-based descriptors of lipids to develop a cheminformatic approach for novel lipid synthesis. Moreover, future models will include parametrizations of the unique biological signatures of target cell types, which will serve as additional input features for integrated models. This will allow us to analyze multiple cell types together within a single unified model instead of developing separate models for each cell type. Unified models will identify the governing characteristics of the target cell influencing the efficacy of LNP gene delivery (e.g., GAG expression, endocytic preferences etc.). Finally, while the current study has focused on in vitro LNP transfection, our pipeline can potentially be improved by incorporating alternative high-throughput in vivo screening techniques such as LNP barcoding45 and cluster-mode screening19 will potentially enable researchers to uncover the mechanisms driving in vivo LNP structure-function relationships.

METHODS

Preparation and characterization of pDNA LNPs.

The LNPs were formulated in 96-well U-bottom plates according to a previously described method19, using DLin-MC3-DMA (MedKoo Biosciences), cholesterol (Sigma-Aldrich), DMG-PEG2000 (NOF America Corporation), and one of the helper lipids, DSPC, DOPE, DOTAP, DDAB, 18PG, and 14PA (all from Avanti Polar Lipids). The lipid phase was prepared by dissolving lipid components in ethanol at a predetermined concentration (based on the formulation composition); and the firefly luciferase pDNA (Aldevron) was diluted in 25 mM magnesium acetate buffer (pH 4.0). LNPs were formulated by combining the aqueous and ethanol phase at a 3:1 ratio with pipette mixing without further purification. LNPs were prepared fresh for each experiment. The size (z-avg), polydispersity index (PDI), and ζ-potential of the LNP library were measured via dynamic light scattering using a Nano ZS90 Zetasizer (Malvern Analytical). Aliquots of 100 μL of LNPs (25 μg pDNA/mL) were used for measuring size and PDI. For ζ-potential, 100 μL of LNPs were diluted into 900 μL of PBS prior to measurement in a capillary cuvette. Measurements were performed in triplicate for each test sample.

In Vitro Transfection and Luciferase Assay.

For monolayer culture studies, HepG2 cells (ATCC HB-8065), HEK293T (ATCC CRL-3216), B16F10 cells (ATCC CRL-6475), PC3 (ATCC CRL-1435), Neuro2a (ATCC CCL-131) and ARPE-19 cells (ATCC CRL-2302) cells (American Type Culture Collection, USA) were seeded into 96-well plates at a cell density of 10,000 cells per well one day prior to transfection. After formulation, LNPs were pipetted into culture medium at a final particle concentration of 1 μg pDNA/mL (i.e., 4 μL of a particle suspension at 25 μg pDNA/mL was pipetted into 100 μL of culture media in the wells. After 48 h of incubation, cells were lysed and luciferase expression was analyzed using a luminometer following a standard protocol.

Helper Lipid Parameterization.

The chemical encoding features used to parameterize the helper lipid structure were carefully selected based on historical knowledge of important lipid structures. Other more complex parameterization methods such as Extended connectivity fingerprints28 were not used to preserve interpretability of the chemical features and due to the relatively small diversity of helper lipids tested (six in total). Helper lipid chemical features such as the number of positively and charged centers, the numbers of hydrogen bond donors and acceptors, and total carbons and double bonds in the lipid tails were included; and the calculated log of the partition coefficient between octanol and water (cLogP) was calculated using ChemDraw (v22.2.0, PerkinElmer).

Helper lipid one hot encoding (OHE) feature data was created by defining six unique features for each helper lipid and populated by denoting helper lipid presence (labeled with 1) or helper lipid absence (labeled with 0) in the LNP formulation.

Data Preprocessing.

To facilitate accurate analysis and interpretation of the experimental data, natural logarithm (ln) transformation of the luciferase assay results, serving as a target output variable, was carried out to rescale the experimental transfection efficiency. Ln transformed RLU values were then floored to the average value of blank wells (around e1.5 RLU) to minimize noise at low and near zero transfection measurements. RLU measurements were then normalized using linear scaling from 0 to 1 as a fraction of the observed maximum transfection values for the corresponding cell type dataset using Eqn. 1. as follows:

Xnorm=(X-Xmin)/(Xmax-Xmin) (1)

The data normalization method was applied to each cell type dataset individually to minimize the impact of cell type “transfectability” in downstream analysis.

Model Development and Evaluation.

Chemically encoded, one hot encoded (negative control), and blended models (negative control) were all developed and evaluated using the same process as described below; however, they differ in training data. Chemically encoded models were trained on datasets containing chemically encoding helper lipid features. One hot encoded models were trained on datasets containing OHE helper lipid features. Blended models were trained on datasets containing both chemically encoded and OHE helper lipid features.

The sci-kit learn python ML package (1.2.2) was used to conduct model training and prediction. The panel of models for evaluation included Multiple linear regression (MLR), Lasso regression (lasso), Partial Least Squares regression (PLS), k-Nearest Neighbors (kNN), Multi-layer perceptron (MLP), Decision Trees (DT), Random Forest (RF), XGboost (XGB), and LightGBM (LGBM) were implemented from the publicly available sci-kit learn (v1.2.1), xgboost (v1.7.1), lightgbm (v3.3.5) python packages. A nested cross-validation procedure for training and evaluating predictive models was adapted from Aspuru-Guzik group’s GitHub repository46. Edits to hyperparameter grids, data splitting functions, and result saving were made for effective integration into our pipeline. This nested approach ensures robust model evaluation, hyperparameter tuning, and generalization, leading to reliable performance estimates while mitigating overfitting risks. First, an 85:15 split of all data creates a training set and a hold-out validation set completely insulated from model development. The 15% hold-out test set was stratified to contain similar helper lipid class proportions as the total dataset. The 85% split enters the nested cross-validation procedure where an outer loop executed over 5-fold cross-validation splits randomly divides the data (85% of all data) into training (67% of all data) and test sets (17% of all data). Within each outer loop iteration, an inner loop conducts a 4-fold split of the training data and runs 100 iterations of hyperparameter optimization for each split to minimize validation set mean absolute error (MAE) using a randomized grid search strategy. MAE was used to evaluate model performance because it quantifies the model’s overall prediction error. By minimizing the MAE, we selected models that achieve a closer alignment between predictions and ground truth values, showcasing the model’s ability to make accurate forecasts on average. The equation for the Absolute error (AE) and MAE is defined as follows:

AE=|yi-yi^| (2)
MAE=1ni=1n|yi-yi^| (3)

where n is the number of samples, yi is the predicted output by the model, and yi^ is the ground-truth output.

The model hyperparameters with the lowest validation MAE from each inner loop splits then are evaluated on the outer loop hold-out test set, generating test set performance metrics including MAE, Spearman’s rank correlation, and Pearson’s correlation coefficient. This process was repeated for each of the five outer loop iterations. To select the final model hyperparameters, the absolute difference between inner loop validation MAE and outer loop hold-out MAE was calculated as the “score difference” for each outer-loop iteration. The model architecture with the minimum “score difference” was selected as the final model to minimize impacts of overfitting. Finally, the chosen model is evaluated on the hold-out validation set (15% of all data) to evaluate performance on a completely unseen dataset.

The model learning curve was produced by evaluating MAE as a function of the number of training datapoints provided. Different sized subsets of the training data were randomly selected, and the resulting train and validation error (MAE) was averaged over 5 repeated 5-fold cross validation test-train data splits. The training size ranged from 1 to 864 formulations for the 1080-formulation datasets, and 1 to 576 for the 720-formulation datasets.

Leave-one-lipid-out (HL-1) model evaluation was conducted by iteratively withholding one helper lipid data from the training dataset, training models using this HL-1 dataset, and testing the model on the withheld helper lipid data. This provides a snapshot of model performance on novel lipid structures.

Straw models were evaluated by averaging the MAE over 20 trials of randomized 5-Kfold test-train splits on datasets where groups of input features were randomly shuffled. Input features were grouped into polar intermolecular interaction features (hydrogen bond donors/acceptor and positively charged centers), apolar features associated with lipid hydrophobicity/rigidity (cLogP and double bonds in tails), and all helper lipid features (both polar and apolar). This provides insight into model utilization of chemical features.

Model Refinement and Interpretation.

The hierarchical clustering package from SciPy (v1.10.0) was used to cluster all initial input features based on the farthest neighbor clustering algorithm. Feature reduction was performed by sequentially removing a single feature based on their Ward linkage distance, and then retraining models on the training data with reduced input parameters for each cell type. The code used was adapted from the Aspuru-Guzik Group’s Github repository46 and the process was adjusted such that features that negatively impacted performance were not removed from the feature set. The performance of each model following removal of select features was evaluated using average MAE across 5 trials of a shuffled 5-fold cross-validation data splits to validate the impact of feature removal on model performance. If feature removal negatively impacted model performance (increased MAE), the feature was retained, and feature refinement continued until the removal of all features was tested. The optimal refined feature set was determined as minimum necessary features to maintain or improve the initial MAE (MAE of model trained on all input features).

Feature importance analysis was conducted using the publicly available Shapley Additive Explanations package (v0.42.1). SHAP analysis plots visualized trends between input features and model outputs. LNP design rule analysis was conducted using the calculated SHAP values, measurements of feature value impacts on the final model output, to identify feature values ranges with high average SHAP values (high average positive impact on model output). These feature ranges provide potential guidelines for the necessary input feature values to maximize model output (transfection efficiency).

Statistical Analysis.

Statistical analysis was conducted using Scipy (python package) or GraphPad Prism 8 for one-way ANOVA followed by Tukey’s multiple comparisons test or Dunnett’s multiple comparisons test and multiple t tests using the Holm-Sidak method, where *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001.

Supplementary Material

Supplementary_Information

Table 1.

Model selection results for LNP transfection in B16F10 cells

Model Type LGBM XGB RF DT MLP kNN MLR PLS lasso
Mean Absolute Error (MAE) 0.049 0.054 0.055 0.057 0.087 0.098 0.100 0.100 0.101
Spearman’s Rank Correlation 0.917 0.905 0.888 0.886 0.834 0.658 0.685 0.685 0.689
Pearson’s Correlation 0.938 0.919 0.912 0.902 0.858 0.671 0.706 0.706 0.707

LGBM, light gradient-boosting machine; RF, random forest; XGB, XGBoost; DT, decision trees; MLP, multi-layer perceptron; kNN, k-nearest neighbors; MLR, multiple linear regression; PLS, partial least squares regression; lasso, least absolute shrinkage and selection operator regression.

Funding Sources

This study was partially supported by the National Institute of Allergy and Infectious Diseases (NIAID) grant U01AI155313 to H.-Q.M. and the National Institute of Biomedical Imaging and Bioengineering grant R21 EB034464 to R.K.

ABBREVIATIONS

LNP

Lipid nanoparticle

pDNA

plasmid DNA

HTS

high throughput screening

ML

machine learning

IL

ionizable lipid

HL

helper lipid

Chol

cholesterol

PEG

DMG-PEG2000 lipid

(IL+HL)

total molar percentage of charged lipids

HL/(IL+HL)

helper lipid molar percentage of all charged lipids

PEG/(Chol+PEG)

PEGylated lipid molar percentage of all uncharged lipids

N/P ratio

nitrogen/phosphate ratio

cLogP

calculated LogP

LGBM

light gradient-boosting machine

RF

random forest

XGB

XGBoost

DT

decision trees

MLP

multi layer perceptron

kNN

k-nearest neighbors

MLR

multiple linear regression

PLS

partial least squares regression

lasso

least absolute shrinkage and selection operator regression

AE

absolute error

MAE

mean absolute error

SHAP

SHapley Additive exPlanations

Footnotes

Supporting Information Available.

Additional data regarding the lipid nanoparticle library composition; additional model selection, optimization, feature analysis, and design rules results for all six cell types (PDF). Relevant code and data will be placed in GitHub and linked to article doi after the publication of this manuscript.

REFERENCES

  • (1).Bulaklak K; Gersbach CA The Once and Future Gene Therapy. Nat Commun 2020, 11 (1), 5820. 10.1038/s41467-020-19505-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (2).Sung Y; Kim S Recent Advances in the Development of Gene Delivery Systems. Biomater Res 2019, 23 (1), 8. 10.1186/s40824-019-0156-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (3).Wang D; Tai PWL; Gao G Adeno-Associated Virus Vector as a Platform for Gene Therapy Delivery. Nat Rev Drug Discov 2019, 18 (5), 358–378. 10.1038/s41573-019-0012-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (4).Ren D; Fisson S; Dalkara D; Ail D Immune Responses to Gene Editing by Viral and Non-Viral Delivery Vectors Used in Retinal Gene Therapy. Pharmaceutics 2022, 14 (9), 1973. 10.3390/pharmaceutics14091973. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (5).Shirley JL; De Jong YP; Terhorst C; Herzog RW Immune Responses to Viral Gene Therapy Vectors. Molecular Therapy 2020, 28 (3), 709–722. 10.1016/j.ymthe.2020.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (6).Tenchov R; Bird R; Curtze AE; Zhou Q Lipid Nanoparticles─From Liposomes to mRNA Vaccine Delivery, a Landscape of Research Diversity and Advancement. ACS Nano 2021, 15 (11), 16982–17015. 10.1021/acsnano.1c04996. [DOI] [PubMed] [Google Scholar]
  • (7).Jung HN; Lee S-Y; Lee S; Youn H; Im H-J Lipid Nanoparticles for Delivery of RNA Therapeutics: Current Status and the Role of in Vivo Imaging. Theranostics 2022, 12 (17), 7509–7531. 10.7150/thno.77259. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (8).Yang L; Gong L; Wang P; Zhao X; Zhao F; Zhang Z; Li Y; Huang W Recent Advances in Lipid Nanoparticles for Delivery of mRNA. Pharmaceutics 2022, 14 (12), 2682. 10.3390/pharmaceutics14122682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (9).Huang Q; Zeng J; Yan J COVID-19 mRNA Vaccines. Journal of Genetics and Genomics 2021, 48 (2), 107–114. 10.1016/j.jgg.2021.02.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (10).Hald Albertsen C; Kulkarni JA; Witzigmann D; Lind M; Petersson K; Simonsen JB The Role of Lipid Components in Lipid Nanoparticles for Vaccines and Gene Therapy. Advanced Drug Delivery Reviews 2022, 188, 114416. 10.1016/j.addr.2022.114416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (11).Sun D; Lu Z-R Structure and Function of Cationic and Ionizable Lipids for Nucleic Acid Delivery. Pharm Res 2023, 40 (1), 27–46. 10.1007/s11095-022-03460-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (12).Aldosari BN; Alfagih IM; Almurshedi AS Lipid Nanoparticles as Delivery Systems for RNA-Based Vaccines. Pharmaceutics 2021, 13 (2), 206. 10.3390/pharmaceutics13020206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (13).Patel S; Ashwanikumar N; Robinson E; Xia Y; Mihai C; Griffith JP; Hou S; Esposito AA; Ketova T; Welsher K; Joyal JL; Almarsson Ö; Sahay G Naturally-Occurring Cholesterol Analogues in Lipid Nanoparticles Induce Polymorphic Shape and Enhance Intracellular Delivery of mRNA. Nat Commun 2020, 11 (1), 983. 10.1038/s41467-020-14527-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (14).Kulkarni JA; Witzigmann D; Leung J; Tam YYC; Cullis PR On the Role of Helper Lipids in Lipid Nanoparticle Formulations of siRNA. Nanoscale 2019, 11 (45), 21733–21739. 10.1039/C9NR09347H. [DOI] [PubMed] [Google Scholar]
  • (15).Zong Y; Lin Y; Wei T; Cheng Q Lipid Nanoparticle (LNP) Enables mRNA Delivery for Cancer Therapy. Advanced Materials 2023, 2303261. 10.1002/adma.202303261. [DOI] [PubMed] [Google Scholar]
  • (16).Zhao X; Chen J; Qiu M; Li Y; Glass Z; Xu Q Imidazole‐Based Synthetic Lipidoids for In Vivo mRNA Delivery into Primary T Lymphocytes. Angew. Chem. Int. Ed 2020, 59 (45), 20083–20089. 10.1002/anie.202008082. [DOI] [PubMed] [Google Scholar]
  • (17).Ni H; Hatit MZC; Zhao K; Loughrey D; Lokugamage MP; Peck HE; Cid AD; Muralidharan A; Kim Y; Santangelo PJ; Dahlman JE Piperazine-Derived Lipid Nanoparticles Deliver mRNA to Immune Cells in Vivo. Nat Commun 2022, 13 (1), 4766. 10.1038/s41467-022-32281-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (18).Guimaraes PPG; Zhang R; Spektor R; Tan M; Chung A; Billingsley MM; El-Mayta R; Riley RS; Wang L; Wilson JM; Mitchell MJ Ionizable Lipid Nanoparticles Encapsulating Barcoded mRNA for Accelerated in Vivo Delivery Screening. Journal of Controlled Release 2019, 316, 404–417. 10.1016/j.jconrel.2019.10.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (19).Zhu Y; Shen R; Vuong I; Reynolds RA; Shears MJ; Yao Z-C; Hu Y; Cho WJ; Kong J; Reddy SK; Murphy SC; Mao H-Q Multi-Step Screening of DNA/Lipid Nanoparticles and Co-Delivery with siRNA to Enhance and Prolong Gene Expression. Nat Commun 2022, 13 (1), 4282. 10.1038/s41467-022-31993-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (20).Zhu Y; Ma J; Shen R; Lin J; Li S; Lu X; Stelzel JL; Kong J; Cheng L; Vuong I; Yao Z-C; Wei C; Korinetz NM; Toh WH; Choy J; Reynolds RA; Shears MJ; Cho WJ; Livingston NK; Howard GP; Hu Y; Tzeng SY; Zack DJ; Green JJ; Zheng L; Doloff JC; Schneck JP; Reddy SK; Murphy SC; Mao H-Q Screening for Lipid Nanoparticles That Modulate the Immune Activity of Helper T Cells towards Enhanced Antitumour Activity. Nat. Biomed. Eng 2023, 1–17. 10.1038/s41551-023-01131-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (21).Gong D; Ben-Akiva E; Singh A; Yamagata H; Est-Witte S; Shade JK; Trayanova NA; Green JJ Machine Learning Guided Structure Function Predictions Enable in Silico Nanoparticle Screening for Polymeric Gene Delivery. Acta Biomaterialia 2022, 154, 349–358. 10.1016/j.actbio.2022.09.072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (22).Wang W; Feng S; Ye Z; Gao H; Lin J; Ouyang D Prediction of Lipid Nanoparticles for mRNA Vaccines by the Machine Learning Algorithm. Acta Pharmaceutica Sinica B 2022, 12 (6), 2950–2962. 10.1016/j.apsb.2021.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (23).Maharjan R; Hada S; Lee JE; Han H-K; Kim KH; Seo HJ; Foged C; Jeong SH Comparative Study of Lipid Nanoparticle-Based mRNA Vaccine Bioprocess with Machine Learning and Combinatorial Artificial Neural Network-Design of Experiment Approach. International Journal of Pharmaceutics 2023, 640, 123012. 10.1016/j.ijpharm.2023.123012. [DOI] [PubMed] [Google Scholar]
  • (24).Gao H; Kan S; Ye Z; Feng Y; Jin L; Zhang X; Deng J; Chan G; Hu Y; Wang Y; Cao D; Ji Y; Liang M; Li H; Ouyang D Development of in Silico Methodology for siRNA Lipid Nanoparticle Formulations. Chemical Engineering Journal 2022, 442, 136310. 10.1016/j.cej.2022.136310. [DOI] [Google Scholar]
  • (25).Tamasi MJ; Patel RA; Borca CH; Kosuri S; Mugnier H; Upadhya R; Murthy NS; Webb MA; Gormley AJ Machine Learning on a Robotic Platform for the Design of Polymer–Protein Hybrids. Advanced Materials 2022, 34 (30), 2201809. 10.1002/adma.202201809. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (26).Nohara Y; Matsumoto K; Soejima H; Nakashima N Explanation of Machine Learning Models Using Improved Shapley Additive Explanation. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics; ACM: Niagara Falls NY USA, 2019; pp 546–546. 10.1145/3307339.3343255. [DOI] [Google Scholar]
  • (27).Li S; Hu Y; Li A; Lin J; Hsieh K; Schneiderman Z; Zhang P; Zhu Y; Qiu C; Kokkoli E; Wang T-H; Mao H-Q Payload Distribution and Capacity of mRNA Lipid Nanoparticles. Nat Commun 2022, 13 (1), 5561. 10.1038/s41467-022-33157-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (28).Rogers D; Hahn M Extended-Connectivity Fingerprints. J. Chem. Inf. Model 2010, 50 (5), 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
  • (29).Cheng X; Lee RJ The Role of Helper Lipids in Lipid Nanoparticles (LNPs) Designed for Oligonucleotide Delivery. Advanced Drug Delivery Reviews 2016, 99, 129–137. 10.1016/j.addr.2016.01.022. [DOI] [PubMed] [Google Scholar]
  • (30).Mason RD; Lind DA; Marchal WG Statistics: An Introduction, 5th ed.; Duxbury Pr; Subsequent edition. [Google Scholar]
  • (31).Carozza G; Tisi A; Capozzo A; Cinque B; Giovannelli A; Feligioni M; Flati V; Maccarone R New Insights into Dose-Dependent Effects of Curcumin on ARPE-19 Cells. IJMS 2022, 23 (23), 14771. 10.3390/ijms232314771. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (32).Mannino G; Cristaldi M; Giurdanella G; Perrotta RE; Lo Furno D; Giuffrida R; Rusciano D ARPE-19 Conditioned Medium Promotes Neural Differentiation of Adipose-Derived Mesenchymal Stem Cells. WJSC 2021, 13 (11), 1783–1796. 10.4252/wjsc.v13.i11.1783. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (33).Bannigan P; Bao Z; Hickman RJ; Aldeghi M; Häse F; Aspuru-Guzik A; Allen C Machine Learning Models to Accelerate the Design of Polymeric Long-Acting Injectables. Nat Commun 2023, 14 (1), 35. 10.1038/s41467-022-35343-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (34).Kotu V; Deshpande B Classification. In Data Science; Elsevier, 2019; pp 65–163. 10.1016/B978-0-12-814761-0.00004-6. [DOI] [Google Scholar]
  • (35).Nathanael K; Cheng S; Kovalchuk NM; Arcucci R; Simmons MJH Optimization of Microfluidic Synthesis of Silver Nanoparticles: A Generic Approach Using Machine Learning. Chemical Engineering Research and Design 2023, 193, 65–74. 10.1016/j.cherd.2023.03.007. [DOI] [Google Scholar]
  • (36).Yakoubi S; Kobayashi I; Uemura K; Nakajima M; Hiroko I; Neves MA Recent Advances in Delivery Systems Optimization Using Machine Learning Approaches. Chemical Engineering and Processing - Process Intensification 2023, 188, 109352. 10.1016/j.cep.2023.109352. [DOI] [Google Scholar]
  • (37).Yu F; Wei C; Deng P; Peng T; Hu X Deep Exploration of Random Forest Model Boosts the Interpretability of Machine Learning Studies of Complicated Immune Responses and Lung Burden of Nanoparticles. Science Advances 2021, 7 (22), eabf4130. 10.1126/sciadv.abf4130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (38).Batra R; Loeffler TD; Chan H; Srinivasan S; Cui H; Korendovych IV; Nanda V; Palmer LC; Solomon LA; Fry HC; Sankaranarayanan SKRS Machine Learning Overcomes Human Bias in the Discovery of Self-Assembling Peptides. Nat. Chem 2022, 14 (12), 1427–1435. 10.1038/s41557-022-01055-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (39).Loecher A; Bruyns-Haylett M; Ballester PJ; Borros S; Oliva N A Machine Learning Approach to Predict Cellular Uptake of pBAE Polyplexes. Biomater. Sci 2023, 11 (17), 5797–5808. 10.1039/D3BM00741C. [DOI] [PubMed] [Google Scholar]
  • (40).Yamankurt G; Berns EJ; Xue A; Lee A; Bagheri N; Mrksich M; Mirkin CA Exploration of the Nanomedicine-Design Space with High-Throughput Screening and Machine Learning. Nat Biomed Eng 2019, 3 (4), 318–327. 10.1038/s41551-019-0351-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (41).Metwally AA; Nayel AA; Hathout RM In Silico Prediction of siRNA Ionizable-Lipid Nanoparticles In Vivo Efficacy: Machine Learning Modeling Based on Formulation and Molecular Descriptors. Front. Mol. Biosci 2022, 9, 1042720. 10.3389/fmolb.2022.1042720. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (42).Sousa De Almeida M; Susnik E; Drasler B; Taladriz-Blanco P; Petri-Fink A; Rothen-Rutishauser B Understanding Nanoparticle Endocytosis to Improve Targeting Strategies in Nanomedicine. Chem. Soc. Rev 2021, 50 (9), 5397–5434. 10.1039/D0CS01127D. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (43).Rogers D; Hahn M Extended-Connectivity Fingerprints. J. Chem. Inf. Model 2010, 50 (5), 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
  • (44).Cheng Q; Wei T; Farbiak L; Johnson LT; Dilliard SA; Siegwart DJ Selective Organ Targeting (SORT) Nanoparticles for Tissue-Specific mRNA Delivery and CRISPR–Cas Gene Editing. Nat. Nanotechnol 2020, 15 (4), 313–320. 10.1038/s41565-020-0669-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (45).Dahlman JE; Kauffman KJ; Xing Y; Shaw TE; Mir FF; Dlott CC; Langer R; Anderson DG; Wang ET Barcoded Nanoparticles for High Throughput in Vivo Discovery of Targeted Therapeutics. Proceedings of the National Academy of Sciences 2017, 114 (8), 2060–2065. 10.1073/pnas.1620874114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (46).Aspuru-Guzik A Long-Acting-Injectables, 2022. https://github.com/aspuru-guzik-group/long-acting-injectables (accessed 2023-12-04). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Information

RESOURCES