Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Sep 1.
Published in final edited form as: Biochim Biophys Acta Biomembr. 2020 May 11;1862(9):183350. doi: 10.1016/j.bbamem.2020.183350

A Machine Learning Approach to Estimation of Phase Diagrams for Three-Component Lipid Mixtures

Mohammadreza Aghaaminiha a, Sara Akbar Ghanadian b, Ehsan Ahmadi c, Amir M Farnoud a
PMCID: PMC7301216  NIHMSID: NIHMS1596971  PMID: 32407774

Abstract

The plasma membrane of eukaryotic cells is commonly believed to contain ordered lipid domains. The interest in understanding the origin of such domains has led to extensive studies on the phase behavior of mixed lipid systems. Three-component phase diagrams, composed of a high melting temperature (Tm) lipid, cholesterol, and a low Tm lipid have been valuable in studying lipid phase behavior. However, developing phase diagrams over the entire composition space and with precise tie-lines requires significant experimental effort. In this study, a machine learning approach was used to predict the Tm of lipids and generate phase diagrams from lipid mixtures. First, artificial neural network (ANN) was used for the prediction of Tm. The network was trained using available Tm data and was able to generate Tm values that closely matched literature results for its testing dataset. This model was then used to predict the Tm for lipids that have not yet been experimentally tested. Then, random forests (RF) and support vector machines (SVM) were trained and tested for their ability to predict a test three-component phase diagram. The model from the RF algorithm was able to generate a diagram that closely matched published results. This model was then used to generate phase diagrams for lipid mixtures at various temperatures and various degrees of unsaturation. This machine learning approach to the generation of lipid phase diagrams has the potential to save significant time and resources in studies of lipid phase behavior.

Keywords: Lipids, melting temperature, phase diagram, machine learning, membranes

Graphical Abstract

graphic file with name nihms-1596971-f0001.jpg

1. INTRODUCTION

Over the past few decades, significant evidence has gathered regarding the presence of lateral heterogeneities, commonly known as lipid domains or rafts, in the plasma membrane of eukaryotic cells. These phase separations are the result of the preferential aggregation of cholesterol and saturated, long-chain lipids [15]. Ordered lipid domains have been postulated to play an important role in a variety of cell functions including signal transduction [3], endocytosis [6], and cell division [7], among others. Given the importance of lipid domains in various biological phenomena, significant effort has been focused on fundamental studies to examine lipid phase segregation [811]. However, due to the complexity of the plasma membrane, such studies are generally performed in well-defined model systems to provide information regarding the role of lipid chemical structure and molar ratio in the lateral heterogeneity of the membrane.

Phase diagrams have emerged as an important tool in characterizing lipid phase behavior in model membranes. First proposed by Feigenson and Buboltz [8], such diagrams are generally developed for systems composed of three lipids: one lipid with a high melting temperature, a sterol (generally cholesterol), and another lipid with a low melting temperature [1,4]. The melting temperature (Tm) itself is highly dependent on the lipid chemical structure, such as chain length, headgroup composition, and number and type of unsaturation [4,12]. Examination of changes in the phase behavior of three-component lipid mixtures has led to the development of diagrams, in which various lipid phases, including liquid-disordered (Ld), liquid-ordered (Lo), and gel () phase, as well as their coexistence, can be observed as a function of the composition of lipids, their molar ratio, and the temperature. Such phase diagrams have been instrumental in allowing researchers to draw generalizable conclusions on the role of each lipid component in lipid phase behavior.

While three-component phase diagrams are valuable in understanding and interpreting lipid phase behavior, they are difficult to develop experimentally. The development of phase diagrams generally requires more than one experimental technique and a combination of fluorescence spectroscopy, confocal microscopy, and sometimes X-ray based techniques are used [911]. Besides, determining the exact position of the tie-lines requires that a large number of samples be tested with significant precision to cover the entire composition space. For example, it has been estimated that approximately 400 samples of different compositions are needed to achieve five mole percent resolution for a three-component phase diagram [9]. More importantly, a simple change in the lipid composition of one of the components or changes in temperature would require repeating the experiments to redevelop the phase diagram for the new conditions. New techniques that could develop such phase diagrams without the need for expensive and time-consuming experimental procedures would significantly enhance the current understanding of lipid phase behavior without significant experimental costs.

Machine learning is a technique for mining dataset and using patterns and inference to develop new information. Supervised machine learning uses example input-output pairs to train an algorithm. In this case, the dataset is usually split into “training”, “cross-validation”, and “testing” categories. The training dataset is used to train a predictive model, validating dataset helps improve the model during training, and the testing dataset is used to evaluate the performance of the constructed predictive model. Machine learning has already found applications in the characterization of lipid-protein interactions in biological membranes [13,14]. However, the application of machine learning in analysis of membrane phase separation is only starting to gain attention, with two very recent reports focusing on the use of machine learning to analyze lipid phase separations based on results from molecular dynamics simulations [15] and to examine the role of lipid domains in the recruitment of proteins to transmembrane receptors [16]. However, despite the importance of phase diagrams in understanding the biophysical behavior of membrane lipids, and the availability of experimental information that can be used to train algorithms, the use of machine learning to develop phase diagrams has remained unexplored.

In the current study, we applied supervised machine learning to predict the melting temperature of phospholipids depending on their chemical structure and to develop three-component lipid phase diagrams for various lipid mixtures. First, the artificial neural network (ANN) was used for the prediction of phospholipid melting temperature. Next, random forests (RF) and support vector machines (SVM) were used to develop a three-component phase diagram for lipid mixtures containing cholesterol, a low Tm, and a high Tm phospholipid. Results show, for the first time, that it is possible to utilize machine learning techniques to determine phospholipid melting temperatures that closely match values reported in the literature. Results also reveal that it is possible to use machine learning to develop three-component phase diagrams for various lipid mixtures, or the same mixture at different temperatures, without the need for experimentation. This work can pave the way for the application of machine learning techniques in the evaluation of lipid phase behavior, thereby helping save significant time and resources.

2. MATERIALS AND METHODS

2.1. Description of machine learning methods

Three machine learning algorithms, ANN, implemented in the NeuroSolutions 7.1.1 package, and RF and SVM, implemented in the R software package, were used in this research. In brief, ANN is a method to imitate the nonlinear learning process in the networks of neurons in the brain. In other words, ANN is a set of interconnected nodes, called neurons, which are organized in different layers and connected through edges whose weights are continuously adjusted throughout the learning process [17,18]. RF is a machine learning method that works by constructing a combination of decision trees for classification [19]. Successive decision trees are independently constructed using two main hyper-parameters of the number of trees (NT) and the number of variables (NV) [20]. These two hyper-parameters indicate how many variables are randomly sampled for building the predefined number of decision trees. In the end, the constructed decision trees vote for the most popular class to finalize classification [21]. The voting in here means that the reported phase for each data point is estimated based on the most frequent phase predicted by all trees for that data point. SVM is a statistical learning method that has been applied to various classification and forecasting problems [22]. It constructs a set of hyperplanes, which separates the data into different classes by boundaries. To define these boundaries, a subset of the training set, called supported vectors, should be constructed. Then, the algorithm looks for the optimal separating hyperplane between these classes by maximizing the distance between the hyperplane and margins [23]. SVM works with four different classifier functions including linear, polynomial, radial basis, and sigmoid [24]. Two important hyper-parameters of SVM are cost and Gamma. Cost indicates the cost of constraints’ violation and it controls the trade-off between the margin and errors of classification, while, Gamma is used in the sigmoid classifier function [24]. As the values of the above-mentioned hyper-parameters significantly impact the performance of each algorithm, methods of tuning hyper-parameters of ANN, RF, and SVM are described further in section 2.3.

2.2. Datasets

All three machine learning methods require training datasets. The training dataset for the ANN included the melting temperature of lipids reported in the literature, while the training dataset for the RF and SVM methods included the reported phase diagrams for known lipid mixtures, also acquired from literature reports.

Melting temperature dataset.

The melting temperature dataset included 65 samples, all retrieved from the Lipid Thermotropic Phase Transition Database (LIPIDAT) – NIST Standard Reference Database 34 [25,26]. Each data point was described by eleven features (or input variables), with Tm as the response variable (Table S1). Of the available Tm dataset, 70% of the data was used as the training set, 15% as cross-validation, and the rest (15%) as a testing set. These ratios were selected based on common practices in the machine learning literature. Dataset features included: T1L (tail 1 length, varied between 3 and 24), T2L (tail 2 length, varied between 3 and 24), MW (molecular weight in g/mole), BT (backbone type, glycerol = 1, sphingosine = 2), HS (head size, choline = 104 g/mole, ethanolamine = 61 g/mole, glycerol = 92 g/mole, serine = 105 g/mole, hydroxyl = 17 g/mole), ChT (acyl chain type, not mixed acyl chain = 0, mixed acyl chain = 1), ST (saturation type, saturated = 0, unsaturated-cis = 1, unsaturated-trans = 2), NU (number of unsaturated carbons, varied between 0 and 12), HCh (head charge, zwitterionic = 0, anionic = −1), ID1 (first unsaturated carbon ID number on tail 1), and ID2 (first unsaturated carbon ID number on tail 2). The response variable was Tm, the melting temperature of the lipid. Taking 16:0–22:6 phosphatidylcholine as an example, the features were as follow: T1L = 16, T2L = 22, MW = 806, BT = 1, HS = 104, ChT = 1, ST = 1, NU = 6, HCh = 0, ID1 = 0, ID2 = 4, and Tm = −27.

Phase diagram dataset.

The phase diagram dataset was developed based on the three-component phase diagrams listed in Figure 1 (see Figure S1 for the detailed data points). The first seven diagrams (Figure 1A to Figure 1G) were used as the training and cross-validation dataset. The last diagram (H) formed the testing dataset. Each diagram was redrawn from the literature as a combination of 5,151 data points (see Figure S2). Therefore, the training dataset contained 7 × 5,151 = 36,057 samples. An abbreviated and not randomized version of the constructed training dataset is reported in Table S2. Each sample in the dataset was described by fifteen features and a response variable (class) as follows (note that Lipid 1 refers to the low Tm lipid and Lipid 2 refers to the high Tm lipid): T1L1 (lipid1 tail1 length), T2L1 (lipid1 tail2 length), ST1 (lipid 1 saturation type), NU1 (lipid 1 number of unsaturated carbons), MW1 (lipid1 molecular weight), TM1 (lipid1 melting temperature), MF1 (lipid1 mole fraction), T1L2 (lipid2 tail1 length), T2L2 (lipid2 tail2 length), ST2 (lipid 2 saturation type), NU2 (lipid 2 number of unsaturated carbons), MW2 (lipid2 molecular weight), TM2 (lipid2 melting temperature), MF2 (lipid2 mole fraction), and T (temperature of the system). The response variable was the phase of the mixture and it is defined in Table 1.

Figure 1.

Figure 1.

Ternary phase diagrams of different phospholipid/cholesterol mixtures. Number codes: 1 = Ld, 2 = Lo, 3 = , 4 = Ld+Lo, 5 = Ld+Lβ, 6 = Lo+Lβ, 7 = Ld+Lo+Lβ, and 8 = Crystals+Lo. A) DLPC/DPPC/CHOL at 24 °C from [8]; B) POPC/DSPC/CHOL at 23 °C from [9]; C) DOPC/DSPC/CHOL at 23 °C from [9]; D) DOPC/POPC/CHOL at 23 °C from [9]; E) DOPC/DPPC/CHOL at 28 °C from [27]; F) DOPC/DPPC/CHOL at 22 °C from [27]; G) DOPC/DPPC/CHOL at 18 °C from [27]; and H) SDPC/BSM/CHOL at 23 °C from [11]. The phase diagram in H was used as the testing set. Lipid abbreviations: DLPC: 1,2-dilauroyl-sn-glycero-3-phosphocholine, DPPC: 1,2-dipalmitoyl-sn-glycero-3-phosphocholine, POPC: 1-palmitoyl-2-oleoyl-glycero-3-phosphocholine, DSPC: 1,2-distearoyl-sn-glycero-3-phosphocholine, DOPC: 1,2-dioleoyl-sn-glycero-3-phosphocholine, SDPC: 1-stearoyl-2-docosahexaenoyl-sn-glycero-3-phosphocholine, BSM: Sphingomyelin (Brain).

Table 1.

Response variable (phase of the mixture) definition

Phase # Abbreviation Definition
1 Ld Liquid-disordered phase
2 Lo Liquid-ordered phase
3 Solid-ordered phase (gel)
4 Ld + Lo Coexistence of Ld and Lo phases
5 Ld + Lβ Coexistence of Ld and phases
6 Lo + Lβ Coexistence of Lo, and phases
7 Ld + Lo + Lβ Coexistence of Ld, Lo, and phases
8 Crystals + Lo Cholesterol monohydrate crystals in equilibrium with a cholesterol-saturated Lo phase [1,8]

Data preprocessing.

To efficiently utilize machine learning for extracting information from the dataset, a series of preprocessing steps including data balancing and data transformation were performed. Specifically, in the phase diagram dataset, the class labels of the response variable are not evenly distributed (i.e. there are more points in one phase compared to others), which could result in difficulties in the learning algorithms (see Table 2). To overcome this unbalancing issue, the well-known synthetic minority over-sampling technique (SMOTE) was used [28]. The general idea behind the SMOTE is to artificially generate new samples of classes from datasets that are not equally represented using their nearest neighbors. This procedure is called over-sampling and is controlled by two parameters of OS and NN. OS controls the size of over-sampling, and NN controls the number of neighbors considered for generating new samples. SMOTE also uses an under-sampling mechanism for the classes with the majority sample. The size of the under-sampling is controlled by the US parameter. Overall, the use of SMOTE will lead to having a more balanced dataset concerning all classes.

Table 2.

Percentages of the class labels of the response variable (phase of the mixture).

Phase Ld Lo Ld+Lo Ld+Lβ Lo+Lβ Ld+Lo+Lβ Crystals+Lo
Percentage % 19.82 36.14 1.72 9.87 11.28 2.44 6.50 12.23

Data transformation.

In the melting temperature dataset, four features BT (backbone type), ChT (acyl chain type), ST (saturation type), and HCh (head charge) were categorical and translated into dummy features accordingly to be implemented in the ANN model. In the phase diagram dataset, ST1 (lipid 1 saturation type) and ST2 (lipid 2 saturation type) were categorical which are translated into dummy features and the rest of the features were treated numerically. In both data sets, Max-Min normalization was performed to scale the value associated with the features to the range of zero to one.

2.3. Design of elements

Most of the machine learning approaches require the setting of hyper-parameters before the training is initiated. The hyper-parameters determine the structure and size of the machine learning models. Here, an analytical technique was used to fine-tune the hyper-parameters to obtain the best outputs for each technique, as outlined below.

Hyper-parameter tuning for ANN.

The grid search technique was used to tune the hyper-parameters of ANN. This is a technique that builds extensive models based on different combinations of the values of the hyper-parameters and selects the combination that results in the best performance. For this purpose, a multi-layer perceptron neural network model with two or three hidden layers (HLs) was examined. The transfer function for the neurons was set to be identity, logistic sigmoid, or hyperbolic tangent. For the learning process, momentum, conjugant gradient, and Levenberg-Marquardt algorithms were investigated as suggested elsewhere [29]. The number of neurons in each HL was varied from 1 to 9 in increments of 1. After comparing all possible combinations of the hyper-parameter values, the best combination that yielded the minimum root mean square error (RMSE), defined in Equation 1, was: two HLs, four neurons in HL1, two neurons in HL2, hyperbolic tangent transfer function, and momentum as the learning algorithm. This combination was selected for building the predictive model.

Root mean square error:  RMSE=i=1n(TmoutputTmdesired)2n (1)

Hyper-parameter tuning for RF and SVM.

The performance of the classification models is highly dependent on the value of their hyper-parameters. Here, the hyper-parameters of RF (NT, and NV), and SVM (Cost and Gamma) along with three hyper-parameters of SMOTE (OS, US, and NN) needed to be tuned. Five level values were considered for each hyper-parameter, resulting in 5^5 = 3,125 permutations for a full factorial design. Because of the random nature involved in these algorithms, multiple executions for each experiment were required to achieve robustness in the results. Assuming 20 runs for each experiment, this results in a total of 3,125×20 = 62,500 executions, which is computationally exhaustive. Thus, the Taguchi method [30] was applied to reduce the number of experiments and determine the proper value for each parameter. Given the five factors, each with five levels, an L25 orthogonal array was designed (Table S3). The value of each parameter corresponding to each factor level is reported in Table 3. The guideline recommended by Hsu and colleagues [31] was used for selecting the range of values for Cost and Gamma hyper-parameters of SVM. According to this guideline, examining exponentially-growing values of Cost and Gamma generates acceptable results.

Table 3.

The value of hyper-parameters corresponding to each level in RF and SVM techniques.

Level Hyper-parameter (Factor)
RF SVM
OS US NN NV ‘ NT OS US NN Cost Gamma
1 100 100 1 3 100 100 100 1 2−3 2−9
2 200 200 5 4 250 200 200 5 2−1 2−7
3 300 300 10 5 500 300 300 10 21 2−5
4 400 400 15 6 750 400 400 15 23 2−3
5 500 500 20 7 1000 500 500 20 25 2−1

The Taguchi method determines an acceptable value for each factor level by maximizing a signal-to-noise (S/N) ratio. Signal represents the mean of the response variable (objective function) and noise denotes standard deviation. The objective function can be in three types: the larger-the-better, the smaller-the-better, and nominal-is-best [32]. In this study, the larger-the-better response variable was applied, as it was desirable to maximize the overall accuracy of the classification models. The S/N ratio is calculated as follows:

Signal-to-noise ratio:  SN=10×log10(1e×i=1e1ACC2) (2)

Here, e is the number of experiments and ACC is the overall accuracy of the classification model. The ACC is calculated by adopting a k-fold cross-validation mechanism. For this purpose, the training dataset is partitioned into k subsets with equal sizes. Then, given the parameters’ value corresponding to each experiment, k-1 partitions are considered for training, and one partition is used for validating the trained models. In this study, a 7-fold cross-validation method was utilized. To do so, six out of the seven phase diagrams (Figure 1A to Figure 1G) were considered as the training set while one phase diagram was considered as the validating dataset. This process was repeated seven times in such a way that each phase diagram appears one time as the training dataset and one time as the validating dataset. Finally, the averaged ACC was obtained and used in S/N calculation. The calculated S/N ratios for both RF and SVM methods are shown in Figure S3. The factor level that yielded the highest S/N ratio for each hyper-parameter is selected as the optimal value. As depicted in Figure S3, the best factor levels for OS, US, NN, NV, and NT as the hype-parameters of SMOTE-RF are 5, 4, 3, 5, and 5, respectively. Figure S3 also illustrates that in the SMOTE-SVM method, the optimal factor levels for OS, US, NN, Cost, and Gamma are 4, 3, 2, 5, and 4, respectively. The optimal value corresponding to each factor are also reported in Table 4.

Table 4.

The optimal values for the factors in the RF and SVM algorithms.

RF SVM
OS US UN NV NT OS US NN Cost Gamma
500 400 10 7 1000 400 300 5 25 2−3

2.4. Evaluation Metrics

In the context of classification problems with multi-classes, the overall accuracy of the model can be measured either by the micro-average accuracy or the macro-average accuracy. The macro-average treats all classes equally, as opposed to the micro-average that treats all samples equally. Given the unbalance phase diagram dataset (see Table 2), the performance of the phase diagram classification models is better reflected by the micro-average criteria. The precision and recall were also reported for each class (Ci, where i ∈ {1,2, …, 8}). These performance metrics are presented below according to the report of Sokolova and colleagues [33]:

Recall:  RecCi=TPCiTPCi+FNCi (3)
Precision:  PreCi=TPCiTPCi+FPCi (4)
Overall Accuracy:  ACC=i=1NTPCii=1N(TPCi+FNCi) (5)

In the above equations, TPCi (true positive) indicates the number of correctly classified instances of class Ci. In the case of occurring errors, FNCi (false negative) is the number of instances that belong to class Ci, but are predicted as other classes, and FPCi (false positive) is the number of instances that do not belong to class Ci but are labeled as class Ci.

3. RESULTS AND DISCUSSION

3.1. Prediction of phospholipid melting temperature

The ANN algorithm was used to predict lipid melting temperature based on the chemical structure of lipids. Various ANN models were evaluated for this purpose and an optimized model was selected, based on minimized root mean square error (RMSE) and maximized accuracy. This model resulted in an RMSE value of 0.06 and an accuracy of 95.42%. Figure 2A shows the performance of the selected ANN model in predicting the melting temperature of 15% of phospholipids randomly selected in the Tm dataset (selected lipids are marked with * in Table S1). From Figure 2A, it can be observed that the predicted values are close to those reported in the literature. This is further confirmed by Figure 2B, in which the predicted and reported Tm values have been plotted. An R2 value of 0.96 confirms the linearity of the plot and the accuracy of the ANN model.

Figure 2.

Figure 2.

Accuracy of the selected ANN model. A) Tm desired (blue solid line) compared to Tm predicted (red dashed line) for each tested sample. B) Tm predicted vs. Tm desired for each tested sample. The red line in A represents the predicted values while the blue line displays the desired values. For example, for sample seven, the desired value is 0.72 (i.e. Tm = 40.0 °C) while the predicted value is 0.73 (i.e. Tm = 42.2 °C). In both figures, the y-axis on the left is the normalized melting temperature, while on the right is the actual melting temperature. All Tm values were normalized based on the training dataset at the time of building the model.

Since the selected model was able to efficiently predict the Tm of the tested samples, it was used to predict the Tm values for lipids for which experimental results are currently not available. To this aim, the lipid headgroup, chain length, and saturation level were altered, and, in each case, the Tm values were generated using the ANN model. Representative results are presented in Table 5, while the complete set of results can be found in Table S4.

Table 5.

A representative table of predicted Tm values for lipids not experimentally tested in the literature from the selected ANN model. A complete set of data can be found in Table S4.

Lipid Tm (°C)
6:0 PG −5.4
8:0 PG −5.3
10:0 PG −4.3
15:0 PG 35.6
17:0 PG 55.6
14:1 (Δ9-Cis) PC −46.9
14:1 (Δ9-Trans) PC 25.6
16:1 (Δ9-Trans) PC 31.5
18:0 PA 72.0
18:2 PA −41.7
02:0 SM (d18:1/2:0) −5.3
06:0 SM (d18:1/6:0) −4.9
12:0 SM (d18:1/12:0) 8.9

It can be observed from Table 5 that while the data generated by the model have not been experimentally confirmed, the trends comply with what is expected from the Tm of lipids based on their chemical structure. In all cases, increasing the number of carbons in the acyl chains led to an increase in Tm. For example, a Tm of −5.3 °C was reported for 6:0 phosphatidyl glycerol (PG), which increased consistently as the number of carbons increased, leading to a melting temperature of 55.6 °C for 17:1 PG. A similar trend is observed in the case of sphingomyelins. In addition, consistent with the literature, increased unsaturation reduced the melting temperature of lipids [12,34]. For example, while a Tm of 72.0 °C was reported for 18:0 palmitic acid (PA), the addition of two cis double bonds in 18:2 PA reduced the melting temperature to −41.7 °C. Cis double bonds decreased the Tm more than trans double bonds as observed in the case of 14:1 (Δ9-Cis) phosphatidy lcholine (PC) and 14:1 (Δ9-trans) PC, which is also in agreement with the literature [12,34]. Note that the Tm for PE lipids has not been predicted using the model given the complex polymorphic phase behavior of these lipids and its dependence on water content [35,36].

3.2. Prediction of phase diagrams

RF and SVM algorithms, using the hyper-parameters presented in Table 4, were used to predict the ternary phase diagrams of phospholipids/cholesterol mixtures. Before predicting the phase diagrams for phospholipid/cholesterol mixtures, for which experimental data was not available, the models were tested for a mixture of SDPC/BSM/CHOL. The phase diagram for this system has previously been reported (Figure 1H) and was developed using the RF and SVM models to examine the accuracy of the diagrams predicted by each model. The phase diagrams generated by the models are presented along with the literature results in Figure 3 and can be found with more details in Figure S4.

Figure 3.

Figure 3.

The phase diagram of SDPC/BSM/Chol mixture as A) reported by Konyakhina et al. [11] and predicted by B) the RF model and C) the SVM model. Number codes: 1 = Ld, 2 = Lo, 3 = , 4 = Ld+Lo, 5 = Ld+Lβ, 6 = Lo+Lβ, 7 = Ld+Lo+Lβ, and 8 = Crystals+Lo. Lipid abbreviations: SDPC: 1-stearoyl-2-docosahexaenoyl-snglycero-3-phosphocholine, BSM: Sphingomyelin (Brain).

As can be discerned from Figure 3, the RF algorithm generated a better prediction of the phase diagram for the tested lipid mixture compared to the SVM algorithm. This is likely due to the SVM algorithm being less sensitive to non-normal distribution of the data, while the RF model is less sensitive to outliers and multicollinearity among the predictor variables. Importantly, the phase diagram generated by the RF model was in close agreement with the literature results (Figure 3A and Figure 3B). While minor differences were observed regarding the size of the phases, the phase boundaries closely mimicked those determined experimentally by Konyakhina and colleagues [11].

The performance of the models was further assessed using a confusion matrix (Table 6). The confusion matrix is a tabular representation of the classification results. The columns of the confusion matrix stand for the actual (desired) phase labels and the rows stand for the predicted phase labels. The values reported in the main diagonal of the confusion matrix represent the number of correctly classified mixtures. For example, the RF model correctly predicted 537 data points to belong to the Ld phase, but incorrectly predicted 90 points to be in the Lo phase, resulting in 86% accuracy. As can be seen in the confusion matrix, considering recall and precision as the two-performance metrics, the RF predictive model performs better than the SVM model in almost all classes. The SVM model only showed slightly better predictive performance for the Ld phase in terms of recall and precision. However, the RF was the superior model when it came to overall accuracy.

Table 6.

The confusion matrix generated for phase diagram prediction by the RF and SVM models. Number codes: 1 = Ld, 2 = Lo, 3 = , 4 = Ld+Lo, 5 = Ld+Lβ, 6 = Lo+Lβ, 7 = Ld+Lo+Lβ, and 8 = Crystals+Lo.

Random Forests (RF) Support Vector Machines (SVM)
Desired (actual) phase → 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
↓ Predicted phase
1 537 0 0 158 156 0 36 0 605 44 0 203 115 0 38 0
2 90 1584 0 63 0 47 19 6 0 1170 0 0 0 0 0 0
3 0 0 85 0 10 0 0 0 0 0 105 0 180 78 19 0
4 0 2 0 498 0 0 140 0 22 317 0 512 1 0 192 0
5 0 0 16 0 658 0 38 0 0 0 0 0 47 0 18 0
6 0 0 4 0 0 58 4 0 0 0 0 0 0 0 0 0
7 0 0 0 0 9 10 299 0 0 15 0 4 130 37 269 0
8 0 0 0 0 0 0 0 624 0 40 0 0 0 0 0 630
Recall 0.86 1.00 0.81 0.69 0.79 0.50 0.56 0.99 0.96 0.74 1.00 0.71 0.49 0.00 0.50 1.00
Precision 0.61 0.88 0.89 0.78 0.92 0.88 0.94 1.00 0.60 1.00 0.27 0.49 0.96 NaN 0.59 0.94
Overall Accuracy 0.84 0.72

Since the RF algorithm was able to accurately predict the ternary phase diagram for the SDPC/BSM/Chol mixture, it was used to generate phase diagrams for lipid mixtures for which such diagrams have not yet been developed. As a first approach, this algorithm was used to investigate the effect of temperature on (i) the phase behavior of a DOPC/DPPC/CHOL mixture, for which the phase diagrams at a range of temperatures have been reported [27], and (ii) the phase behavior of a POPC/PSM/CHOL mixture, for which the phase diagrams have been reported [37], but with less experimental points than the phase diagrams presented in Figure 1. The phase diagrams for both systems were predicted at a high (37 °C) and low temperature (15 °C), allowing for examination of the accuracy of the model, as well as its robustness with respect to temperature. These results are shown in Figure 4. Detailed data points for each phase can be found in Figure S5.

Figure 4.

Figure 4.

The effect of temperature on the lipid phase diagram as generated by the RF model. The phase diagram for DOPC/DPPC/CHOL system at the temperature of A) 15 °C, and B) 37 °C. The phase diagram for the POPC/PSM/CHOL system at C) 23 °C, and D) 37 °C. Number codes: 1 = Ld, 2 = Lo, 3 = , 4 = Ld+Lo, 5 = Ld+Lβ, 6 = Lo+Lβ, 7 = Ld+Lo+Lβ, and 8 = Crystals+Lo. Lipid abbreviations: DOPC: 1,2-dioleoyl-sn-glycero-3-phosphocholine, DPPC: 1,2-dipalmitoyl-sn-glycero-3-phosphocholine, POPC: 1-palmitoyl-2-oleoyl-glycero-3-phosphocholine, PSM: N-palmitoyl-D-erythro-sphingosylphosphorylcholine.

Changing the temperature resulted in significant differences in the phase diagrams generated using the RF algorithm. When the temperature was reduced (Figure 4A and 4C), the boundary of the Ld phase (shown by the number 1 in the diagram) was moved to the left. This was accompanied by a significant increase in the Ld+Lβ, Lo+Lβ, and Ld+Lo+Lβ phases (shown by numbers 5, 6, and 7, respectively). All of these changes are expected. This is because lowering the temperature is expected to order the lipids, due to the lowered kinetic energy, as has been reported for other phase diagrams acquired experimentally at various temperatures [37,38].

Next, the effect of unsaturation in one vs. both acyl chains was examined. To this aim, the RF algorithm was used to generate a phase diagram for the POPC/DPPC/Chol mixture at 22 °C, which was then compared with a DOPC/DPPC/Chol phase diagram from the literature [27] at the same temperature. Here, the presence of a lipid with only one unsaturated acyl chain (POPC) was expected to increase the area of the more ordered phases in the phase diagram, compared to a lipid with two unsaturated acyl chains (DOPC). This is indeed the trend that is observed in Figure 5A and Figure 5B (detailed data points for each phase can be found in Figure S6). The phase diagram for the mixture including POPC showed a substantial increase in the area of the phase in the composition space, which came at the cost of the Lo+Lβ phase and the three-phase coexistence region, demonstrating an increase in lipid order from the liquid-ordered to the gel phase. Similarly, there was an increase in the size of the Ld+Lβ phase, while the size of the Ld+Lo+Lβ phase was reduced, again indicating an increased gel phase and a decreased liquid-ordered phase. All these effects are expected due to the presence of the more saturated POPC instead of DOPC.

Figure 5.

Figure 5.

The effect of lipid unsaturation on the lipid phase diagram. A) The phase diagram for the DOPC/DPPC/CHOL mixture from [27] and B) the phase diagram for the POPC/DPPC/CHOL mixture as generated by the RF model. The temperature in both diagrams is 22 °C. Number codes: 1 = Ld, 2 = Lo, 3 = , 4 = Ld+Lo, 5 = Ld+Lβ, 6 = Lo+Lβ, 7 = Ld+Lo+Lβ, and 8 = Crystals+Lo. Lipid abbreviations: DOPC: 1,2-dioleoyl-sn-glycero-3-phosphocholine, POPC: 1-palmitoyl-2-oleoyl-glycero-3-phosphocholine, DPPC: 1,2-dipalmitoyl-sn-glycero-3-phosphocholine.

To the best of our knowledge, this is the very first approach in using machine learning for prediction of lipid phase diagrams. While the generated phase diagrams still need to be experimentally tested, the changes in the diagrams generated at different temperatures and different lipid compositions using the RF algorithm suggests that the predicted phase diagrams are robust with respect to changes in the composition or temperature and could potentially be used to generate diagrams for mixtures/conditions that have not yet been experimentally tested. This would be particularly beneficial as developing phase diagrams with accurate tie-lines requires significant experimental effort.

This first approach on the use of machine learning for the prediction of lipid phase diagrams could certainly be improved with further optimization. When it comes to optimization, the model can be improved in predicting the Ld+Lo+Lβ phase coexistence region (region 7 in the phase diagrams). This region must be a triangle with 3 straight sides. However, while the model predicts the boundaries for each phase, sometimes the predicted region for this phase does not become an exact triangle. Here, the boundaries for the neighboring regions have been used to identify this region, before plotting the other regions (see e.g. Figure S4). However, this approach is somewhat arbitrary and affects the correct prediction of the three-phase diagram. Future efforts should focus on improving the model to ensure that this phase is represented by a triangle. In addition, the lack of a clear boundary in the transition from the Ld to the Lo regions (i.e. region 1 to region 2) creates an issue in the training of the model. Supervised machine learning models require that all classes (in this case phases) are bounded. Thus, an artificial boundary here has been assumed from the plait point of the Ld+Lo phase and horizontally to the left side of the triangle (see Figure S1). This issue is not solved by better optimization as it is related to the fact that not all phase transitions are first order.

The model can also be improved once further experimental data are available for better training of the algorithm. While the model for Tm prediction has been trained with the available Tm data, the available experimental data are not balanced. For example, there is more Tm data available for lipids with the PC headgroup compared to other headgroups. Also, in unsaturated lipids, there is more data available for lipids with cis unsaturation compared to trans. The availability of such data will help better train the model, leading to more accurate results. The lack of extensive experimental results leads to limitations in the model. For example, while it is possible to use the model to generate phase diagrams at any temperature, the diagrams for temperatures hugely outside the range of 18 °C to 28 °C would not be highly reliable as the training dataset is only available in this temperature range. This problem could be solved once experimental data at such temperatures are available to train the algorithms. A similar issue exists regarding the phase diagrams for lipids including multiple double bonds in one acyl chain. This is because currently, to the best of our knowledge, no experimental phase diagrams exist for mixtures in which one of the lipids contains multiple double bonds. In other words, and as expected, the proposed machine learning algorithm is only as good as its training dataset, and experiments and machine learning algorithms need to go hand-in-hand.

In conclusion, the current study is a first approach in using machine learning for the prediction of melting temperatures and phase diagrams. The algorithms developed in this study accurately predict melting temperature and phase diagrams for molecules and mixtures for which such data is reported and generate results that are in line with theoretical predictions for systems that are not yet explored. However, the algorithms can be improved by including more features and further training once more experimental results become available. While experiments will undoubtedly continue to hold an invaluable place in biomembrane research, it might be possible to utilize machine learning algorithms for systems for which data is not yet available.

Supplementary Material

1
2
3
4
5
6
mmc1

Highlights.

  • Machine learning approaches can be used to predict lipid phase behavior.

  • An artificial neural network accurately predicted lipid transition temperatures

  • A random forest model predicted 3-component phase diagrams of lipid mixtures

  • Predicted phase diagrams closely matched published results

  • Effect of changes in temperature and lipid chemistry on phase diagrams is studied

5. ACKNOWLEDGMENT

AF gratefully acknowledges funding from the NIH (grant R15ES030140). Financial support from the Russ College of Engineering and Technology is also acknowledged.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

6. REFERENCES

  • [1].Feigenson GW, Phase diagrams and lipid domains in multicomponent lipid bilayer mixtures, Biochim. Biophys. Acta BBA - Biomembr 1788 (2009) 47–52. 10.1016/j.bbamem.2008.08.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Elson EL, Fried E, Dolbow JE, Genin GM, Phase Separation in Biological Membranes: Integration of Theory and Experiment, Annu. Rev. Biophys 39 (2010) 207–226. 10.1146/annurev.biophys.093008.131238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Simons K, Toomre D, Lipid rafts and signal transduction, Nat. Rev. Mol. Cell Biol 1 (2000) 31–39. 10.1038/35036052. [DOI] [PubMed] [Google Scholar]
  • [4].Farnoud AM, Toledo AM, Konopka JB, Del Poeta M, London E, Raft-like membrane domains in pathogenic microorganisms, in: Curr. Top. Membr, Elsevier, 2015: pp. 233–268. 10.1016/bs.ctm.2015.03.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Marsh D, Cholesterol-induced fluid membrane domains: A compendium of lipid-raft ternary phase diagrams, Biochim. Biophys. Acta BBA - Biomembr 1788 (2009) 2114–2123. 10.1016/j.bbamem.2009.08.004. [DOI] [PubMed] [Google Scholar]
  • [6].Parton RG, Richards AA, Lipid rafts and caveolae asportals for endocytosis: new insights and common mechanisms, Traffic. 4 (2003) 724–738. 10.1034/j.1600-0854.2003.00128.x. [DOI] [PubMed] [Google Scholar]
  • [7].Alvarez FJ, Douglas LM, Konopka JB, Sterol-rich plasma membrane domains in fungi, Eukaryot. Cell 6 (2007) 755–763. 10.1128/EC.00008-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Feigenson GW, Buboltz JT, Ternary phase diagram of dipalmitoyl-PC/dilauroyl-PC/cholesterol: Nanoscopic domain formation driven by cholesterol, Biophys. J. N. Y 80 (2001) 2775–88. 10.1016/S0006-3495(01)76245-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Konyakhina TM, Wu J, Mastroianni JD, Heberle FA, Feigenson GW, Phase diagram of a 4-component lipid mixture: DSPC/DOPC/POPC/chol, Biochim. Biophys. Acta BBA - Biomembr 1828 (2013) 2204–2214. 10.1016/j.bbamem.2013.05.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Uppamoochikkal P, Tristram-Nagle S, Nagle JF, Orientation of tie-lines in the phase diagram of DOPC/DPPC/Cholesterol model biomembranes, Langmuir. 26 (2010) 17363–17368. 10.1021/la103024f. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Konyakhina TM, Feigenson GW, Phase diagram of a polyunsaturated lipid mixture: Brain sphingomyelin/1-stearoyl-2-docosahexaenoyl-sn-glycero-3-phosphocholine/cholesterol, Biochim. Biophys. Acta BBA - Biomembr 1858 (2016) 153–161. 10.1016/j.bbamem.2015.10.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Cevc G, How membrane chain-melting phase-transition temperature is affected by the lipid chain asymmetry and degree of unsaturation: an effective chain-length model, Biochemistry. 30 (1991) 7186–7193. 10.1021/bi00243a021. [DOI] [PubMed] [Google Scholar]
  • [13].Wang L, Irausquin SJ, Yang JY, Prediction of lipid-interacting amino acid residues from sequence features, Int. J. Comput. Biol. Drug Des 1 (2008) 14–25. 10.1504/ijcbdd.2008.018707. [DOI] [PubMed] [Google Scholar]
  • [14].Cho H, Wu M, Bilgin B, Walton SP, Chan C, Latest developments in experimental and computational approaches to characterize protein–lipid interactions, PROTEOMICS. 12 (2012) 3273–3285. 10.1002/pmic.201200255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Löpez CA, Vesselinov VV, Gnanakaran S, Alexandrov BS, Unsupervised Machine Learning for Analysis of Phase Separation in Ternary Lipid Mixture, J. Chem. Theory Comput 15 (2019) 6343–6357. 10.1021/acs.jctc.9b00074. [DOI] [PubMed] [Google Scholar]
  • [16].Mitra ED, Whitehead SC, Holowka D, Baird B, Sethna JP, Computation of a Theoretical Membrane Phase Diagram and the Role of Phase in Lipid-Raft-Mediated Protein Organization, J. Phys. Chem. B 122 (2018) 3500–3513. 10.1021/acs.jpcb.7b10695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Benítez JM, Castro JL, Requena I, Are artificial neural networks black boxes?, IEEE Trans. Neural Netw 8 (1997) 1156–1164. 10.1109/72.623216. [DOI] [PubMed] [Google Scholar]
  • [18].Bala R, Kumar DD, Classification using ANN: A review, Int. J. Comput. Intell. Res 13 (2017) 1811–1820. [Google Scholar]
  • [19].Breiman L, Random Forests, Mach. Learn 45 (2001) 5–32. 10.1023/A:1010933404324. [DOI] [Google Scholar]
  • [20].Liaw A, Wiener M, Classification and regression by Random Forest, 2 (2002) 18–22. [Google Scholar]
  • [21].Ahmadi E, Garcia-Arce A, Masel DT, Reich E, Puckey J, Maff R, A metaheuristic-based stacking model for predicting the risk of patient no-show and late cancellation for neurology appointments, IISE Trans. Healthc. Syst. Eng 9 (2019) 272–291. 10.1080/24725579.2019.1649764. [DOI] [Google Scholar]
  • [22].Nayak J, Naik B, Behera HS, A comprehensive survey on support vector machine in data mining tasks: Applications & challenges, Int. J. Database Theory Appl 8 (2015) 169–186. 10.14257/ijdta.2015.8.1.18. [DOI] [Google Scholar]
  • [23].Chen W-H, Hsu S-H, Shen H-P, Application of SVM and ANN for intrusion detection, Comput. Oper. Res 32 (2005) 2617–2634. 10.1016/j.cor.2004.03.019. [DOI] [Google Scholar]
  • [24].Bhadra T, Bandyopadhyay S, Maulik U, Differential evolution based optimization of SVM parameters for meta classifier design, Procedia Technol. 4 (2012) 50–57. 10.1016/j.protcy.2012.05.006. [DOI] [Google Scholar]
  • [25].Caffrey M, Lipid thermotropic phase transition database (LIPIDAT). user’s guide Version 1.0 Version 1.0, U.S Dept. of Commerce, National Institute of Standards and Technology, Standard Reference Data Program, Gaithersburg, MD, 1993. [Google Scholar]
  • [26].Silvius JR, Thermotropic phase transitions of pure lipids in model membranes and their modifications by membrane proteins, John Wiley & Sons, New York, 1982. [Google Scholar]
  • [27].Davis JH, Clair JJ, Juhasz J, Phase Equilibria in DOPC/DPPC-d62/Cholesterol Mixtures, Biophys. J 96 (2009) 521–539. 10.1016/j.bpj.2008.09.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP, SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res 16 (2002) 321–357. 10.1613/jair.953. [DOI] [Google Scholar]
  • [29].Lahmiri S, A Comparative study of backpropagation algorithms in financial prediction, Int. J. Comput. Sci. Eng. Appl. IJCSEA 1 (2011). 10.5121/ijcsea.2011.1402. [DOI] [Google Scholar]
  • [30].Roy RK, A primer on the Taguchi method, 2nd ed, Society of Manufacturing Engineers, Dearborn, MI, 2010. [Google Scholar]
  • [31].Hsu C-W, Chang C-C, Lin C-J, A practical guide to support vector classification, (2003) 15. [Google Scholar]
  • [32].Mousavi SM, Hajipour V, Niaki STA, Alikar N, Optimizing multi-item multi-period inventory control system with discounted cash flow and inflation: Two calibrated meta-heuristic algorithms, Appl. Math. Model 37 (2013) 2241–2256. 10.1016/j.apm.2012.05.019. [DOI] [Google Scholar]
  • [33].Sokolova M, Lapalme G, A systematic analysis of performance measures for classification tasks, Inf. Process. Manag 45 (2009) 427–437. 10.1016/j.ipm.2009.03.002. [DOI] [Google Scholar]
  • [34].Murray SM, O’Brien RA, Mattson KM, Ceccarelli C, Sykora RE, West KN, Davis JH, The fluid-mosaic model, homeoviscous adaptation, and Ionic liquids: Dramatic lowering of the melting point by side-chain unsaturation, Angew. Chem. Int. Ed 49 (2010) 2755–2758. 10.1002/anie.200906169. [DOI] [PubMed] [Google Scholar]
  • [35].Cullis PR, De Kruijff B, The polymorphic phase behaviour of phosphatidylethanolamines of natural and synthetic origin. A 31P NMR study, Biochim. Biophys. Acta BBA - Biomembr 513 (1978) 31–42. 10.1016/0005-2736(78)90109-8. [DOI] [PubMed] [Google Scholar]
  • [36].Castresana J, Nieva JL, Rivas E, Alonso A, Partial dehydration of phosphatidylethanolamine phosphate groups during hexagonal phase formation, as seen by i.r. spectroscopy., Biochem. J 282 (1992) 467–470. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].de Almeida RFM, Fedorov A, Prieto M, Sphingomyelin/phosphatidylcholine/cholesterol phase diagram: Boundaries and composition of lipid rafts, Biophys. J. N. Y 85 (2003) 2406–16. 10.1016/S0006-3495(03)74664-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Maulik PR, Shipley GG, N-palmitoyl sphingomyelin bilayers:  structure and interactions with cholesterol and dipalmitoylphosphatidylcholine, Biochemistry. 35 (1996) 8025–8034. 10.1021/bi9528356. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1
2
3
4
5
6
mmc1

RESOURCES