Abstract

Influenza is a respiratory infection caused by the influenza virus that is prevalent worldwide. One of the most contagious variants of influenza is influenza A virus (IAV), which usually spreads in closed spaces through aerosols. Preventive measures such as novel compounds are needed that can act on viral membranes and provide a safe environment against IAV infection. In this study, we screened compounds with common fragrances that are generally used to mask unpleasant odors but can also exhibit antiviral activity against a strain of IAV. Initially, a set of 188 structurally diverse odorants were collected, and their antiviral activity was measured in vapor phase against the IAV solution. Regression models were built for the prediction of antiviral activity using this set of odorants by taking into account their structural features along with vapor pressure and partition coefficient (n-octanol/water). The models were interpreted using a feature weighting approach and Shapley Additive exPlanations to rationalize the predictions as an additional validation for virtual screening. This model was used to screen odorants from an in-house odorant data set consisting of 2020 odorants, which were later evaluated using in vitro experiments. Out of 11 odorants proposed using the final model, 8 odorants were found to exhibit antiviral activity. The feature interpretation of screened odorants suggested that they contained hydrophilic substructures, such as hydroxyl group, which might contribute to denaturation of proteins on the surface of the virus. These odorants should be explored as a preventive measure in closed spaces to decrease the risk of infections of IAV.
Keywords: influenza A virus, virtual screening, antiviral activity, odorants, regression, feature interpretation
Influenza is an acute respiratory infection caused by the influenza virus. It is also known as “the flu” in common language. It routinely spreads in humans leading to the emergence of seasonal epidemics and global outbreaks, thereby causing substantial morbidity, mortality, and economic losses.1−4 Out of different influenza virus types, namely, A, B, and C, influenza A virus (IAV) is the most contagious variant that leads to millions of infections and hundreds of deaths per year.5,6 A strain of IAV (H1N1/pdm09), which emerged in Mexico and the United States, was responsible for swine-origin H1N1 influenza A epidemic in April 2009.7 The predominant mode of transmission of IAV in households is through aerosol particles expelled during coughing or sneezing.8 These small nuclei droplet particles tend to remain airborne for a long time, leading to high transmissibility and increased risk of infection.8
Odorants or fragrance substances are commonly used in household products, such as perfumes, room fresheners, scented candles etc., to provide a fresh environment and mask unpleasant odors.9−11 Traditionally, these odorant substances have been used in aromatherapy, where inhalation of vapors was used to treat cold and respiratory disorders.12 It has also been known that fragrant substances such as essential oils exhibit antimicrobial and anti-influenza activities in vapor phase.13 The antiviral activity of essential oils can be attributed to their lipophilic nature that leads to disruption of viral membrane or interference with viral envelope proteins involved in host cell attachments.14 Since IAV is usually prevalent as aerosols in closed spaces, the use of odorants that act on viral membranes can help in providing a low-risk antiviral environment.
In the past few decades, computational techniques, such as virtual screening, have provided a significant boost in the drug development process.15,16 Large virtual screening libraries are filtered using different approaches, such as similarity searching, pharmacophore modeling, docking, etc., which reduces the number of potential candidate molecules to be tested, which in turn saves time and resources.17,18 With regard to influenza, shape-focused virtual screening techniques, based on a known lead compound, have been previously used to find novel inhibitors for neuraminidase, an enzyme usually targeted against the influenza virus.19 Machine learning approaches have proven to be of great use in the areas of lead generation, lead optimization, and physicochemical property prediction, where mathematical models are employed to obtain the quantitative relationship between biological activity and the structural parameters.20 Furthermore, several methods of interpretation of machine learning models enable prioritization of important features responsible for predictions that can help in understanding the mechanism of action of a compound exhibiting biological activity.21,22
In this study, virtual screening was performed using interpretable regression models to search for odorants that can act against IAV. One-hundred eighty eight odorants pertaining to commonly used fragrances were gathered initially, and antiviral activity was measured by exposing the dried IAV solution to the vaporized form of these compounds. This data set of odorants was used to make regression models for the prediction of antiviral activity against IAV with the help of different structural representations/physical properties. The best-performing model was selected, and important features were studied using a feature weighting approach and Shapley Additive exPlanations (SHAP) as an additional validation for the model to be used for virtual screening. This model was finally used to screen odorants from an in-house odorant data set consisting of 2020 odorants. Out of 11 proposed odorants from the model, 8 odorants were found to be successful in inhibiting IAV when evaluated experimentally, and their antiviral activity was rationalized using feature interpretation approaches. These odorants can further be studied for their use in closed spaces to prevent infection from IAV.
Materials and Methods
Study Design
The aim of the current study is to screen odorants that can exhibit antiviral activity against IAV. For this purpose, a set of structurally diverse odorants were collected, and their antiviral activity was measured, which formed the initial data set. Regression models were built from the experimental activity data of odorants using structural features and physical properties. Feature interpretations provide an additional validation for the models by helping in rationalizing predictions. The workflow and design of the current study are shown in Figure 1. Each step is explained in detail in the following sections.
Figure 1.
Overview of the study. The workflow and design of the study are shown, which aims at screening and validating odorants that can exhibit antiviral activity against influenza A virus.
Compound Data Sets
A set of 188 structurally diverse odorant compounds (Table S1 and Figure S1) were collected, and their antiviral activity was tested against an enveloped virus, IAV strain A/Puerto Rico/8/34 (H1N1). Partition coefficient of n-octanol/water (calculated log P or clogP) and vapor pressure (VP) were computed with EPI Suite software23 for all of the compounds and were found to have a wide range of values from below zero to 8 (clogP) and zero to more than 1000 (VP), respectively (Figures S2 and S3). The antiviral activity was calculated as survived virus/total virus %, i.e., the lower the number of survived virus, the lower will be the percentage, hence better will be the antiviral activity. If all of the viruses are deactivated by an odorant compound, its antiviral activity will be 0%. The antiviral activity was determined using a focus-forming assay, where Madin-Darby canine kidney (MDCK) cells were infected with IAV and stained with an antibody to count its colonies.24 The total virus and survived virus were counted as the number of colonies before and after the exposure to vaporized odorants for 30 min at 22–23 °C. The details of antiviral activity measurement are given in the next section. The distribution of antiviral activity over a set of 188 odorants is shown in Figure 2. In total, 68 molecules out of 188 showed antiviral activity of less than 10% of which 20 molecules had an antiviral activity close to the measurement limit (<0.001%). This initial data set consisted of compounds with already known activity, such as ethanol and higher alcohols, and isolated compounds from essential oils, showing antiviral ability in vapor phase against influenza A virus.13
Figure 2.

Distribution of antiviral activity. Shown is the distribution of antiviral activity percentage over a set of 188 odorants.
Cells and Viral Cultures
The cells and viral cultures were maintained, as described previously, by Onishi et al.24 MDCK cells (CCL-34) were procured from American Type Culture Collection (Manassas, VA), followed by their maintenance in minimum essential medium (Invitrogen Corporation, NY) with the addition of 5% (v/v) heat-inactivated fetal bovine serum (Sigma-Aldrich Co. LLC, St. Louis, MO) and 50 μg/ml gentamicin (Invitrogen Corporation). The viral strain, IAV strain A/Puerto Rico/8/34 (H1N1), was cultivated on MDCK cells in serum-free medium (Thermo Fisher Scientific Japan K.K., Kanagawa, Japan) supplemented with acetylated trypsin (2 μg/mL) (Sigma-Aldrich) and gentamicin (50 μg/mL). The viral culture was centrifuged at 13,000g for 2 h, the pellet was resuspended, and the solution was collected in a condensed viral phase.
Measurement of Antiviral Activity in Vapor Phase
For evaluation of antiviral activity, 75 μL of a compound was soaked on a cotton ball (Yamatokojo Co., Ltd., Osaka, Japan) and affixed to the lid of a 15 mL glass vial (Maruemu Corporation, Osaka, Japan), followed by incubation at 22–23 °C for 30 min. The IAV solution (1.5 μL with 8.3 × 105 FFU) was dried on the lid of a cryo vial (Thermo Fisher Scientific Japan K.K.) for 30 min. The lid was then placed into the 15 mL glass vial, exposing the lid to vaporized compounds for 30 min at 22–23 °C. After exposure, the lid was removed and IAV was dissolved in serum-free medium, which was then analyzed for virus titer through the focus-forming assay, as described previously by Onishi et al.24
Molecular Representations
The structural information of odorant compounds was represented by several descriptors. First of all, Morgan fingerprints were used, which includes all substructures until a fixed radius is reached from the center of each atom in the molecule.25 Binary Morgan fingerprints with radius 2 were generated using RDKit software26 for the odorant data set, which were folded to a fixed length of 1024 bits. In addition, count Morgan fingerprints were also generated, which replaced the presence or absence of a substructure with its actual number of occurrences. SMARTS patterns27 corresponding to each bit were stored using a hash value that can be mapped onto the compounds. For control calculations, odorant compounds were also represented by Mol2vec descriptors28 (300) of radius 2 as well as two-dimensional (2D) property descriptors generated from Dragon software29 (1832). Mol2vec is an unsupervised machine learning approach to obtain high-dimensional embeddings of chemical substructures. According to its algorithm, which is based on word2vec, vectors of chemically related substructures are grouped together in a vector space.28
Feature Selection
Feature selection was performed on each category of structural descriptors. First, correlation was calculated between dependent variables. One feature in each pair of highly correlated variables was deleted over a threshold of 0.9. Next, four assembled models, namely, least absolute shrinkage and selection operator (LASSO),30 ridge regression,31 stepwise LASSO,32 and random forest,33 were used to calculate the correlation between dependent and independent variables (activity). The dependent features having no correlation to the activity were deleted when decided by at least 3 models. The remaining descriptors, along with VP and clogP as additional variables, were used to train the regression models for the prediction of antiviral activity.
Model Building and Validation Protocol
The odorant compounds were divided into 80% training and 20% test samples. With this approach, 151 compounds were used as the training set and 37 compounds were used as the test set. Cross validation was performed using the training set to select the best hyperparameters for each regression model. Followed by the selection of the best parameters, regression models were trained for test set activity predictions.
For evaluation of regression models, the coefficient of determination (R2), mean absolute error (MAE), and root-mean-squared error (RMSE) were calculated. R2 gives the proportion of variance of label y explained by independent model variables and provides an indication of goodness of fit. MAE and RMSE represent the average and the standard deviation of residuals in the data set, respectively. These metrics tell us how accurate our predictions are and what is the amount of deviation from the actual values.
The metrics are calculated by the following equations
where yi and ŷi are actual and predicted activity values of odorant compound i, respectively, y̅ is the mean of the sum of all measured activity values, and n is the total number of odorant compounds. Five-fold cross validation was performed for each regression model, and the results were averaged over 5 trials.
Regression Methods
Ordinary Least Squares Regression
Ordinary least squares (OLS) regression is a statistical analysis method that estimates the relationship between independent variables (descriptors) and a dependent variable (activity) by minimizing the sum of the squares in the difference between the actual and predicted values of the dependent variable.34 The model creates a linear relationship in the form of a straight line that best approximates all of the individual data points. The regression equation can be written in the following form
where y is the target variable, x1,x2,x3,···xn are the features, a1,a2,a3···an are the regression coefficients, and b is the regression constant for intercept.
Partial Least Squares Regression
Partial least squares (PLS) regression is a technique that reduces the predictors to a smaller set of uncorrelated components and performs least squares regression on these components instead of the original data.35 PLS creates orthogonal components using correlations between independent variables and their outputs while keeping most of the variance of independent variables. It is highly useful when predictors are highly collinear or when the number of predictors is more than observations.
Ridge Regression
Linear regression methods work by selecting coefficients for each independent variable that minimizes a loss function, i.e., the function that computes the distance between the current output of the model and the expected output. If the coefficients are large, it leads to overfitting on the training data set and the model is unable to perform well on unseen data. The ridge regression (RR) method provides an extension to linear regression, where the loss function is modified to minimize the complexity of the model.31 A penalty parameter is added to the loss function that is equivalent to the square of the magnitude of the coefficients.
Here, α parameter is to be selected so as to avoid overfitting and underfitting.
LASSO Regression
LASSO is also a modification of linear regression, where the loss function is modified by limiting the sum of the absolute values of the model coefficients.30
LASSO can set a few coefficients to zero unlike ridge regression, and in this way, it helps in variable selection.
Elastic Net
Elastic net (EN) can be considered as a convex combination of both ridge and LASSO regression.36 It uses the penalties of both and reduces the impact of different features while not eliminating all of the features.
Gaussian Process Regression
Gaussian process (GP) regression is a nonparametric, Bayesian approach to regression that calculates probability distribution over all admissible functions that fit the data.37 For GP regression, a Gaussian process prior is specified using a mean function and a covariance kernel function. Posterior is calculated using the training data, and predictive posterior distribution is computed on our points of interest. The collection of training and test points is the joint multivariate Gaussian distribution.
We used a combined kernel function for GP
where C is the constant, RBF is the radial basis function kernel, and W is the white kernel.
Random Forest
Random forest (RF) is an ensemble of decision trees, which is based on bootstrap aggregation and feature bagging to reduce the variance of individual trees.33 Training decision trees with distinct compound subsets are generated, and a random subset of features is considered during node splitting for the construction of trees. The final RF prediction is a result of consensus predictions across all decision trees. In the case of RF regression, the average of all predicted values gives the final value.
Support Vector Regression
Support vector regression (SVR) is a supervised learning algorithm that is used to predict discrete values.38 SVR tries to find the best fit line i.e., the hyperplane that has the maximum number of points within a threshold value. The threshold value ε is the distance between the hyperplane and boundary line. For training, labeled examples are mapped into a descriptor space and a function f(x) = ⟨w,x⟩ + b is derived that best predicts the target values for test instance x (w is the normal vector of the hyperplane and b is the bias). Parameters w,b are derived via the following optimization equations
![]() |
where {x(i),y(i) |i = 1,···,n} is a set of n training examples x(i) with activity values y(i), parameters ξ(i),ξ*(i) represent slack variables that are added to permit errors for training instances falling within the margin or on the incorrect side of the hyperplane, and C is the cost or regularization hyperparameter introduced to balance training errors and margin size. Training examples that are predicted with a deviation of more than ε from their actual value fall outside the hyperplane and are called support vectors. The prediction function of linear SVR can be expressed as
where α(i) are Lagrangian multipliers used in the dual optimization problem. If linear regression modeling of training data in a given space is not possible, the scalar product ⟨.,.⟩ is replaced by a kernel function K(.,.), which projects the data into higher dimensional space, where linear separation is possible. The prediction function using the kernel trick is written as
Tanimoto kernel, defined in accordance with the Tanimoto coefficient, is often used in compound potency prediction. It is defined as follows for two compound fingerprints u and v
Feature Interpretation Methods
Feature Contributions for the Tanimoto Kernel in SVR
In the SVR model, a feature weighting method was applied, where different weights were assigned to individual fingerprint features corresponding to the coefficients of the primal optimization problem.21 In the case of a linear kernel, feature weights can be determined directly from the dual problem coefficients and support vectors. For the nonlinear kernel, direct access to feature weights is not possible because an explicit mapping into high-dimensional feature space is not computed. However, the decision function with the Tanimoto kernel can be expressed as the sum of feature contributions and the bias since the denominator of the kernel function is constant for each individual support vector
The weight of a single feature d to an individual SVR prediction can be obtained by decomposition of the Tanimoto kernel
The above equation helps to calculate the feature contribution of structural as well as nonstructural features (real number values), such as clogP and VP in the odorant data set. The contributing structural features can be mapped to the compound structures using the SMARTS patterns assigned to them.
SHAP
Shapley Additive exPlanations (SHAP) is a method to explain individual predictions based on game theory.39 The Shapley value concept was originally proposed to estimate the contribution of an individual player in a team toward the outcome of a game, which would help in fair distribution of payouts. Shapley values account for the magnitude of each feature’s contribution along with its direction. Features with positive sign contribute toward increasing the activity value, whereas features with negative sign contribute toward decreasing the activity value (i.e., increasing the antiviral ability in the current study).
In the SHAP algorithm, the explanation of the model is specified as
where g is the local explanation model, z′ ∈ {0,1}M is the coalition vector with an entry of one meaning feature is “present” and 0 meaning “absent”, M is the number of features, and ϕj ∈ R is the Shapley value or feature contribution of feature j. The Shapley value of a feature is calculated as the average contribution of features, which are predicted in different situations, i.e., across all possible permutations of a feature set. The features are individually added to the set, and the change in model output determines their importance.
Results and Discussion
Performance of Regression Models
Regression models were built for the prediction of antiviral activity of odorant compounds using Morgan fingerprints, Mol2vec, and Dragon 2D property descriptors each in combination with clogP and VP (see Materials and Methods section). With a feature selection approach, the number of structural features for the odorant data set was reduced to 67 for Morgan fingerprints, 156 for Dragon descriptors, and 26 for Mol2Vec descriptors. Five-fold cross validation was performed, and R2 was calculated separately for the training and test data. The result was averaged over 5 trials, and the largest R2 value for the test set determined the best-performing model.
For binary Morgan fingerprints, which consisted of 69 features (67 + clogP, VP), the best predictive model was found to be the Gaussian process with an R2 value of 0.64, followed by the SVR model using the Tanimoto kernel resulting in an R2 value of 0.61. These results improved further when odorant compounds were represented using count Morgan fingerprints, where the R2 value reached 0.68 both with the Gaussian process and SVR using the Tanimoto kernel. MAE and RMSE values were also low for these two models in comparison to all other regression models as shown in Table 1. When Mol2vec (28 features) and Dragon 2D property descriptors (158 features) were used, the best-performing models were SVR and Gaussian process with R2 values of 0.61 and 0.63, respectively. Regression models were also built without using feature selection approaches for each descriptor set; however, the performance was found to be worse in comparison to feature selection. Thus, the Gaussian process model and SVR using the Tanimoto kernel were the best-performing models using count Morgan fingerprints after feature selection. Y–Y plots for all 188 predictions using the SVR model and Gaussian process are shown in Figure 3a,b, respectively. Out of 20 odorant compounds having 0% activity, 12 compounds were predicted as highly active with an activity range ≤6%.
Table 1. Performance of Regression Models for Count Morgan Fingerprintsa.
| OLS | PLS | RR | LASSO | EN | LSVR | NLSVR | RF | GP | |
|---|---|---|---|---|---|---|---|---|---|
| R2 | –7.4 × 1024 | 0.60 | 0.63 | 0.51 | 0.50 | 0.60 | 0.68 | 0.50 | 0.68 |
| MAE | 1.15 × 1013 | 16.09 | 15.52 | 18.42 | 18.17 | 16.22 | 14.13 | 16.11 | 14.36 |
| RMSE | 6.67 × 1013 | 20.85 | 20.09 | 23.01 | 23.11 | 20.84 | 18.74 | 22.77 | 18.45 |
Coefficient of determination (R2), mean absolute error (MAE), and root-mean-square error (RMSE) are listed for each of the nine regression models used for the prediction of antiviral activity of odorant test data.
Figure 3.
Actual vs predicted antiviral activity. Shown is a graph of measured (actual) antiviral activity percentage values against predicted antiviral activity percentage values for 188 odorant data set using (a) nonlinear SVR and (b) Gaussian process model.
Model Diagnostics
To better understand the predictions of antiviral activity of odorant compounds, it is important to interpret the model. One prerequisite for feature interpretation is good prediction performance of the model. In the case of a linear regression model (OLS), feature contributions can be estimated based on the regression coefficients. However, linear regression models did not perform well compared to other methods in the current study as shown in Table 1. Hence, features were studied using two different methods—SVR feature weighting approach and SHAP. For the interpretation of features, we chose SVR with the Tanimoto kernel (since both the best-performing models using count Morgan fingerprints—Gaussian process and SVR with the Tanimoto kernel, showed almost similar performance), where both the feature interpretation methods could be applied. Five trials that were performed during cross validation of SVR yielded R2 values of 0.58, 0.65, 0.69, 0.76, and 0.70. The fourth trial consisted of many accurate or close to accurate predictions, which were further studied for feature interpretations. Predictions of two exemplary odorant compounds with high antiviral ability and two exemplary odorant compounds with low antiviral ability are discussed below.
The features were weighted according to the Tanimoto kernel method (see Materials and Methods section), and the contributing features toward the prediction were mapped onto the structure using their SMARTS patterns. Blue color is assigned to features, which contribute toward decreasing the activity percentage (negative feature weight), and red color to the ones increasing the activity percentage (positive feature weight). The higher the contribution from an atom or a group, darker will be the respective color in the visualization. For the SHAP force plot, the same color scheme was used, where red color is assigned to positive Shapley values, whereas blue color is assigned to negative Shapley values. The length of the bars corresponds to the contribution of each feature toward prediction. Along with the structural features, contributions of clogP and VP were also recorded using both the methods. Figure 4a shows an odorant compound (cinnamic alcohol) with high antiviral ability having activity 0.16%, which was predicted in close proximity by the SVR model as 4.77%. The highest contributing features in decreasing the activity percentage of this odorant compound were found to be Bit 222, which is a hydroxyl functional group attached to the carbon atom (marked in blue), and clogP as shown in Figure 4a. It is known that alcohols are capable of destroying envelope of the virus and denaturing protein, which might be responsible for lowering the risk of infection. Also, the hydroxyl group is a hydrophilic functional group, and the odorant compound has low log P, which might lead to easy interaction with water around the virus sample and better accessibility of the odorant to the virus, thereby reducing its effect. There were few features that contributed toward increasing the activity percentage of cinnamic alcohol, such as Bit 356, which is a carbon atom within the benzene ring; however, the feature contribution of these features was not as high as the ones with the negative value. Hence, cinnamic alcohol was predicted to have high antiviral ability. The results of the SVR feature weight method were comparable to SHAP as the highest contributing features were found to be similar. However, there were few missing features using both the methods, which were still contributing toward activity prediction. The reason for the occurrence of missing features is that all of the features are scaled for training and prediction with the model as there is a large difference in the range of count bits and values of clogP and VP. Therefore, features having zero value are also scaled to a real number, thereby showing contributions to the predictions. Also, in the case of SHAP, Shapley values assess the impact of the presence or absence of a feature based on the difference between the actual and mean output of the decision function due to which a missing feature could contribute to the prediction.
Figure 4.
Feature contributions of odorants with high antiviral ability. Two exemplary compounds are shown in panels (a) and (b) that exhibit high antiviral ability and have good predictions with the nonlinear SVR model. The feature weights of each contributing bit are given along with the SHAP force plot. Blue color represents features that contribute toward decreasing the activity percentage (increasing antiviral ability) and red color represents features that contribute toward increasing the activity percentage (decreasing antiviral ability). The length of the bars in the SHAP force plot corresponds to the contribution of each feature toward prediction. Structures of the features are represented in the black box.
Figure 4b shows another exemplary odorant compound (cis-3-hexenyl formate) that had an activity of 0% and almost accurate prediction with an activity of 1.40%. Using both SVR feature weights and SHAP, it was found that the feature weights only consisted of negative contributing features i.e., features were assigned weights toward the direction of decreasing the activity percentage. The most contributing features were found to be Bit 773 and Bit 1004, which are esters with the carbonyl group at the end and an aldehyde, respectively. Aldehydes have an ability to denature proteins and affect proteins like influenza hemagglutinin (HA) and neuraminidase (NA) on the surface of the virus. The most contributing features from the model also show hydrophilic ability along with lower logP and tend to react with water around the virus easily thereby, increasing the antiviral ability of this odorant compound.
In contrast to the above examples, we also studied the feature contributions of low antiviral ability odorant compounds. Figure 5a shows feature weights and SHAP results of an exemplary odorant compound, geranyl n-butyrate, with a measured activity of 77.08% and a predicted activity of 75.0%. Majority of the contributing features were found to have positive weights i.e., toward increasing the activity percentage as marked in red on the visualization. Feature having maximum weight was Bit 677, which is an ester with a carbonyl group in the middle, whereas clogP was found to be the most contributing one using SHAP although there was no significant difference in the contributions of Bit 677 and clogP. The two features Bit 677 and Bit 893 (carbonyl group in the middle of the ester) along with higher log P tend to make the odorant compound hydrophobic. No features such as aldehyde or hydroxyl group were present in this compound, which could contribute toward denaturing of proteins on the surface of the virus. Another example of an odorant compound, α-pinene, is shown in Figure 5b, where most of the substructures were assigned positive weights as depicted in the visualization. This odorant compound had an activity of 68.78%, which was predicted as 65.50% by the SVR model using the Tanimoto kernel. It consisted of a group of substructures on the fatty ring (Bit 859, Bit 528, Bit 268, Bit 926) that contributed highly toward the increase in activity percentage. The SVR feature interpretation results were found exactly similar to SHAP as shown in the force plot (Figure 5b). The mentioned group of substructures (Bit 859, Bit 528, Bit 268, Bit 926) on the fatty ring are known to be hydrophobic. With no aldehyde or alcohol embedded in the structure, this odorant compound might find difficult to interact with the envelope or proteins present on the surface of the virus. Taken together, important features responsible for the successful predictions of odorant compounds were extracted using feature interpretation methods and rationalized in terms of their chemical perspective.
Figure 5.
Feature contributions of odorants with low antiviral ability. Two exemplary compounds are shown in panels (a) and (b) that exhibit low antiviral ability and have good predictions with the nonlinear SVR model. The feature weights of each contributing bit are given along with the SHAP force plot. Blue color represents features that contribute toward decreasing the activity percentage (increasing antiviral ability), and red color represents features that contribute toward increasing the activity percentage (decreasing antiviral ability). The length of the bars in the SHAP force plot corresponds to the contribution of each feature toward prediction. Structures of the features are represented in the black box.
Virtual Screening of Odorants with Antiviral Ability
For the virtual screening of odorants with antiviral activity, the SVR model with the Tanimoto kernel was used, which was trained with count Morgan fingerprints along with cl og P and VP, as explained in the previous sections. Since the activity of initial data set consisting of 188 odorant compounds was experimentally calculated, the SVR model was built using all of the compounds as the training set to utilize the structural context and experimental information of the whole data set. The coefficient of determination (R2 value) for the prediction of the training set using this model was found to be 0.94 with MAE and RMSE values 3.16 and 8.41, respectively.
The SVR model was used to screen odorants with antiviral ability from an in-house odorant data set developed by Kao Corporation, which consisted of 2020 compounds. The odorants in this data set were represented in a similar fashion as the training compounds, keeping the same order of count Morgan fingerprint bits along with clogP and VP. Activity of the compounds in this data set was calculated using the SVR model. A total of 11 odorants from the in-house data set were found to have a predicted activity <1% as shown in Table 2. For control calculations, an equally good performing Gaussian process model was used to screen the in-house odorant data set. The proposed odorants from the SVR model were found to be in top 13 selected odorants from the Gaussian process model as well. Predicted activities of 11 odorants from both SVR and Gaussian process models are listed in Table S2. These 11 odorants were further subjected to gas-phase experiments for their antiviral activity evaluation.
Table 2. Antiviral Activity Evaluation of Screened Odorants Using the SVR Modela.
| compound no. | IUPAC name | measured % | predicted % | difference |
|---|---|---|---|---|
| 1 | 2-allylbicyclo[2.2.2]oct-5-en-2-ol | 44.4 | –4.11 | 48.54 |
| 2 | 1-(2-methoxypropoxy)-2-methylpropan-2-ol | <0.005 | 0.41 | <0.41 |
| 3 | 5-methoxy-2,5-dimethyl-3-hexyn-2-ol | <0.005 | –3.88 | <3.89 |
| 4 | 6-methoxy-2,6-dimethylhept-4-yn-2-ol | <0.005 | –1.62 | <1.62 |
| 5 | 6-methoxy-2,6-dimethylhept-3-yn-2-ol | 0.005 | –0.97 | 0.98 |
| 6 | 6-methoxy-2,6-dimethylhept-3-en-2-ol | 0.6 | 0.78 | 0.17 |
| 7 | 2,2,5-trimethyl-4-hexen-1-ol | <0.4 | –1.024 | <1.42 |
| 8 | 4-(isopentyloxy)-2-methylbutan-2-ol | 21.3 | –0.68 | 21.99 |
| 9 | dimethyl disulfide | 0.3 | 0.99 | 0.70 |
| 10 | γ-hexanolactone | <0.005 | –2.43 | <2.44 |
| 11 | Phenoxyethanol | 30.4 | 0.41 | 30.03 |
Screened odorants using the SVR model from the in-house data set that had a predicted activity of <1% are given along with their measured values, predicted values, and the difference between them.
Antiviral Activity Evaluation
The vapor phase antiviral activity of the screened odorants (<1% activity; Table 2) was determined following the aforementioned method (“Cells and Viral Cultures” and “Measurement of Antiviral Activity in Vapor Phase” in Materials and Methods). Out of 11 odorants having a predicted activity lower than 1%, 8 odorants (compounds 2, 3, 4, 5, 6, 7, 9, and 10) were evaluated as true, with a difference between predicted and actual values being less than 4%. On the other hand, measured values of 3 odorants (compounds 1, 8, and 11) were found to have a large difference from their predicted values (48.54, 21.99, and 30.03%, respectively). To the best of our knowledge, except phenoxyethanol (compound 11), which has long been utilized as an antibacterial or sanitizing reagent, almost all odorants have not been known as antiviral active so far. Feature contributions of the screened odorants were calculated from the generated model, and important features were looked upon for better understanding of the predictions.
Feature Interpretation of Screened Odorants
Structures of 11 odorants that were experimentally tested for their antiviral ability are shown in Figure 6, where the contributions of different features from the SVR model are marked, according to the scheme mentioned in previous sections (see Model Diagnostics section). Six compounds inside the black box (Figure 6a) consisted of the hydroxyl group as one of the most contributing features (marked in blue color on the structures) along with clogP. As mentioned earlier, alcohol may destroy the envelope and denature proteins, which could be the possible reason for their lower ability of infection. This result was in accordance with our understanding that the hydroxyl group, which is hydrophilic, along with lower log P may help the odorants to easily interact with water around to get access to the virus. In addition to the hydroxyl groups, the methyl groups on compounds 5 and 6 also contributed significantly to the predictions of antiviral ability. Furthermore, the ring side chain on compound 10 along with clogP (hydrophilic) was found to be important contributors for the prediction of antiviral ability. These functional groups have not been reported previously as antiviral activity enhancers; however, they do exist in experimentally verified active compounds and are picked up by our model. Therefore, substructures found important in compounds 5, 6, and 10 need to be investigated in future for their role in antiviral activity as they might provide a new guideline for making effective antiviral odorants. For dimethyl disulfide (compound 9), highly active prediction was based upon high VP (3266.4) and not the structural content of the compound. In this case, higher VP proved to be advantageous because the experiment is performed in vapor phase. Out of 11 odorants chosen by our model, which were tested experimentally in lab for their antiviral ability, 3 odorants were found to have low antiviral ability. Structures of these odorants along with their feature contributions from the model are shown in Figure 6b. Although these compounds contained the hydroxyl group, which was picked up by the model as one of the most contributing features, in actual experiments, they were not able to achieve antiviral activity. It may be attributed to data sparseness and training through a limited number of compounds as well as different conditions that the model was unable to predict these odorants having low antiviral ability. Thus, these three examples highlight the general limitation of machine learning and regression models, where all successful predictions cannot be rationalized due to a small sample size. Overall, the prediction accuracy of majority of the screened odorants from the model was good and could be chemically rationalized in terms of antiviral ability.
Figure 6.
Screened odorants. (a) Screened odorants whose antiviral ability was experimentally validated are depicted along with their ID, clogP, and vapor pressure (VP) value. Odorants, where the hydroxyl group and clogP had maximum contributions, are represented in the black box. VP and the ring side chain had major contributions for compounds 9 and 10, respectively. (b) Screened odorants, where antiviral activity was not experimentally verified, are shown along with their ID, clogP, and VP. Feature contributions are marked on the compounds with blue color representing features that contribute toward decreasing the activity percentage (increasing antiviral ability) and red color representing features that contribute toward increasing the activity percentage (decreasing antiviral ability).
Conclusions
In search of odorants that tend to exhibit antiviral activity, virtual screening was performed using interpretable regression models. Initially, a set of 188 structurally diverse odorants having a wide range of clogP and VP were collected, and their antiviral activity was measured in vapor phase. Various regression models were trained from this data set using different feature representations along with clogP and VP. These models were compared based on the performance metrics, and the best-performing model was selected. The support vector regression model using the Tanimoto kernel reached a good prediction accuracy that helped in interpretation of important features using SHAP and feature weighting approach. This model, which was learned with the knowledge of antiviral ability of odorants, was used to screen odorants from an in-house data set with unknown antiviral ability. Furthermore, the predicted odorants were validated experimentally for their antiviral ability. Out of 11 proposed odorants from screening, 8 odorants were found to be good antiviral agents. Feature interpretations provided a basis for the chemical rationalization of odorants toward their antiviral ability. Hydrophilic structural groups, such as the hydroxyl group, carbonyl group at the end of an ester, etc., were found to be the most important substructures that might help odorants to denature proteins present on the surface of the virus as well as to interact with water around for easy accessibility to the virus. In the case of dimethyl disulfide, higher VP contributed to the prediction and validation of odorants as an antiviral agent since the experiments were done in vapor phase. Functional groups such as carbonyl groups present in the middle of an ester and substructures contained on a fatty ring were found to be contributing to decreasing the antiviral ability of odorants. However, not all successful and unsuccessful predictions could be rationalized, which is a general limitation of machine learning and regression models. To summarize, the interpretable regression models used in this study successfully predicted the antiviral characteristics of odorants, and these compounds were found consistent on experimental validation. Odorants screened through our model have not been known for their antiviral activity (except for phenoxyethanol) and can prove to be novel inhibitors against influenza A virus. These proposed odorants from our model can help in providing an environment with low risk of viral infection.
Acknowledgments
clogP and vapor pressure of all of the odorants were computed using EPI Suite software,23 provided by U.S. Environmental Protection Agency. Morgan fingerprints and feature mapping using SMARTS patterns were performed using RDKit software.26 The authors thank Takuya Mori and Satoshi Oono from Kao Corporation for helpful discussions.
Supporting Information Available
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsptsci.2c00193.
List of 188 odorants; Tanimoto similarity matrix of 188 odorants; distribution of clogP; distribution of vapor pressure; and predicted activities from the models (PDF)
The authors declare no competing financial interest.
Supplementary Material
References
- Tang J. W.; Shetty N.; Lam T. T. Y.; Hon K. L. E. Emerging, Novel, and Known Influenza Virus Infections in Humans. Infect. Dis. Clin. North Am. 2010, 24, 603–617. 10.1016/j.idc.2010.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stiver H. G. The Threat and Prospects for Control of an Influenza Pandemic. Expert Rev. Vaccines 2004, 3, 35–42. 10.1586/14760584.3.1.35. [DOI] [PubMed] [Google Scholar]
- Krammer F.; Smith G. J. D.; Fouchier R. A. M.; Peiris M.; Kedzierska K.; Doherty P. C.; Palese P.; Shaw M. L.; Treanor J.; Webster R. G.; García-Sastre A. Influenza. Nat. Rev. Dis. Primers 2018, 4, 3 10.1038/s41572-018-0002-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leung N. H. L. Transmissibility and Transmission of Respiratory Viruses. Nat. Rev. Microbiol. 2021, 19, 528–545. 10.1038/s41579-021-00535-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nicholls H. Pandemic Influenza: The Inside Story. PLoS Biol. 2006, 4, e50 10.1371/journal.pbio.0040050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu L.; Jiang W.; Jia H.; Zheng L.; Xing J.; Liu A.; Du G. Discovery of Multitarget-Directed Ligands Against Influenza A Virus From Compound Yizhihao Through a Predictive System for Compound-Protein Interactions. Front. Cell. Infect. Microbiol. 2020, 10, 16 10.3389/fcimb.2020.00016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith G. J. D.; Vijaykrishna D.; Bahl J.; Lycett S. J.; Worobey M.; Pybus O. G.; Ma S. K.; Cheung C. L.; Raghwani J.; Bhatt S.; Peiris J. S. M.; Guan Y.; Rambaut A. Origins and Evolutionary Genomics of the 2009 Swine-Origin H1N1 Influenza A Epidemic. Nature 2009, 459, 1122–1125. 10.1038/nature08182. [DOI] [PubMed] [Google Scholar]
- Cowling B. J.; Ip D. K. M.; Fang V. J.; Suntarattiwong P.; Olsen S. J.; Levy J.; Uyeki T. M.; Leung G. M.; Malik Peiris J. S.; Chotpitayasunondh T.; Nishiura H.; Mark Simmerman J. Aerosol Transmission Is an Important Mode of Influenza A Virus Spread. Nat. Commun. 2013, 4, 1935 10.1038/ncomms2922. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arctander S.Perfume and Flavor Chemicals (Aroma Chemicals); Allured Publishing: Newark, 1994; Vol. 2. [Google Scholar]
- McAndrew B. A. Perfumes: Art, Science and Technology, Edited by P. M. Muller; D. Lamparsky, Elsevier Applied Science: London, New York, 1991. Flavour Fragrance J. 1992, 7, 239–240. 10.1002/ffj.2730070414. [DOI] [Google Scholar]
- Berger R. G. Scent and Chemistry. The Molecular World of Odors. By Günther Ohloff, Wilhelm Pickenhagen and Philip Kraft. Angew. Chem., Int. Ed. 2012, 51, 3058. 10.1002/anie.201201256. [DOI] [Google Scholar]
- Sadlon A. E.; Lamson D. W. Immune-Modifying and Antimicrobial Effects of Eucalyptus Oil and Simple Inhalation Devices. Altern. Med. Rev. 2010, 15, 33–47. [PubMed] [Google Scholar]
- Vimalanathan S.; Hudson J. Anti-influenza Virus Activity of Essential Oils and Vapors. Am. J. Essent. Oils Nat. Prod. 2014, 2, 47–53. [Google Scholar]
- Wani A. R.; Yadav K.; Khursheed A.; Rather M. A. An Updated and Comprehensive Review of the Antiviral Potential of Essential Oils and Their Chemical Constituents with Special Focus on Their Mechanism of Action against Various Influenza and Coronaviruses. Microb. Pathog. 2021, 152, 104620 10.1016/j.micpath.2020.104620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McInnes C. Virtual Screening Strategies in Drug Discovery. Curr. Opin. Chem. Biol. 2007, 11, 494–502. 10.1016/j.cbpa.2007.08.033. [DOI] [PubMed] [Google Scholar]
- Kirchmair J.; Distinto S.; Liedl K. R.; Markt P.; Rollinger J. M.; Schuster D.; Spitzer G. M.; Wolber G. Development of Anti-Viral Agents Using Molecular Modeling and Virtual Screening Techniques. Infect. Disord. - Drug Targets 2011, 11, 64–93. 10.2174/187152611794407782. [DOI] [PubMed] [Google Scholar]
- Murgueitio M. S.; Bermudez M.; Mortier J.; Wolber G. In Silico Virtual Screening Approaches for Anti-Viral Drug Discovery. Drug Discovery Today: Technol. 2012, 9, e219–e225. 10.1016/j.ddtec.2012.07.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Makau J. N.; Watanabe K.; Ishikawa T.; Mizuta S.; Hamada T.; Kobayashi N.; Nishida N. Identification of Small Molecule Inhibitors for Influenza a Virus Using in Silico and in Vitro Approaches. PLoS One 2017, 12, e0173582 10.1371/journal.pone.0173582. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kirchmair J.; Rollinger J. M.; Liedl K. R.; Seidel N.; Krumbholz A.; Schmidtke M. Novel Neuraminidase Inhibitors: Identification, Biological Evaluation and Investigations of the Binding Mode. Future Med. Chem. 2011, 3, 437–450. 10.4155/fmc.10.292. [DOI] [PubMed] [Google Scholar]
- Lavecchia A. Machine-Learning Approaches in Drug Discovery: Methods and Applications. Drug Discovery Today 2015, 20, 318–331. 10.1016/j.drudis.2014.10.012. [DOI] [PubMed] [Google Scholar]
- Balfer J.; Bajorath J. Visualization and Interpretation of Support Vector Machine Activity Predictions. J. Chem. Inf. Model. 2015, 55, 1136–1147. 10.1021/acs.jcim.5b00175. [DOI] [PubMed] [Google Scholar]
- Rodríguez-Pérez R.; Bajorath J. Interpretation of Machine Learning Models Using Shapley Values: Application to Compound Potency and Multi-Target Activity Predictions. J. Comput.-Aided Mol. Des. 2020, 34, 1013–1026. 10.1007/s10822-020-00314-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- US EPA: Estimation Programs Interface Suite for Microsoft Windows, v 4.11. https://www.epa.gov/tsca-screening-tools/epi-suitetm-estimation-program-interface (accessed July 20, 2021).
- Onishi S.; Mori T.; Kanbara H.; Habe T.; Ota N.; Kurebayashi Y.; Suzuki T. Green Tea Catechins Adsorbed on the Murine Pharyngeal Mucosa Reduce Influenza A Virus Infection. J. Funct. Foods 2020, 68, 103894 10.1016/j.jff.2020.103894. [DOI] [Google Scholar]
- Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
- RDKit: Cheminformatics and Machine Learning Software. http://www.rdkit.org (accessed October 23, 2021).
- James C. A.; Weininger D.; Delany J.. SMARTS Theory. Daylight Theory Manual; Daylight Chemical Information Systems: Laguna Niguel, CA, 2000. [Google Scholar]
- Jaeger S.; Fulle S.; Turk S. Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. J. Chem. Inf. Model. 2018, 58, 27–35. 10.1021/acs.jcim.7b00616. [DOI] [PubMed] [Google Scholar]
- Helguera A.; Combes R.; Gonzalez M.; Cordeiro M. N. Applications of 2D Descriptors in Drug Design: A DRAGON Tale. Curr. Top. Med. Chem. 2008, 8, 1628–1655. 10.2174/156802608786786598. [DOI] [PubMed] [Google Scholar]
- Tibshirani R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc.: Ser. B (Methodol.) 1996, 58, 267–288. 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]
- Hoerl A. E.; Kennard R. W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 2000, 42, 80–86. 10.1080/00401706.2000.10485983. [DOI] [Google Scholar]
- Hastie T.; Tibshirani R.; Tibshirani R. Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons. Stat. Sci. 2020, 35, 579–592. 10.1214/19-STS733. [DOI] [Google Scholar]
- Breiman L. Random Forests. Mach. Learn. 2001, 45, 5–32. 10.1023/A:1010933404324. [DOI] [Google Scholar]
- Zdaniuk B.Ordinary Least-Squares (OLS) Model. In Encyclopedia of Quality of Life and Well-Being Research; Springer Netherlands: Dordrecht, 2014; pp 4515–4517. [Google Scholar]
- Garthwaite P. H. An Interpretation of Partial Least Squares. J. Am. Stat. Assoc. 1994, 89, 122. 10.1080/01621459.1994.10476452. [DOI] [Google Scholar]
- Zou H.; Hastie T. Regularization and Variable Selection via the Elastic Net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. 10.1111/j.1467-9868.2005.00503.x. [DOI] [Google Scholar]
- Rasmussen C. E.; Williams C. K. I.. Gaussian Processes for Machine Learning; The MIT Press, 2006. [Google Scholar]
- Drucker H.; Burges C. J. C.; Kaufman L.; Smola A.; Vapnik V.. Support Vector Regression Machines. In Advances in Neural Information Processing Systems 9; NIPS, 1996. [Google Scholar]
- Lundberg S. M.; Lee S.. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30; NIPS, 2017. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






