“intelligent Read Across (iRA)”- A tool for read-across-based toxicity prediction of nanoparticles

Souvik Pore; Kunal Roy

doi:10.1016/j.csbj.2025.07.032

. 2025 Jul 17;29:186–200. doi: 10.1016/j.csbj.2025.07.032

“intelligent Read Across (iRA)”- A tool for read-across-based toxicity prediction of nanoparticles

Souvik Pore ¹, Kunal Roy ^1,^⁎

PMCID: PMC12296513 PMID: 40718579

Abstract

The rapid advancement of nanotechnology has enabled the use of nanoparticles (NPs) in various applications, such as medicine, electrochemical sensors, and cosmetics, due to their unique physical and chemical properties. Their small size allows these particles to penetrate biological systems and interact with intracellular components, which may pose significant health risks to humans and other organisms. As a result, assessing the health risks and environmental impacts of NPs has gained considerable attention. Experimental evaluation of NP toxicity is resource-intensive and raises ethical issues; therefore, various computational methods are used for toxicity assessments. In this research, we introduce a Python-based tool called “intelligent Read Across (iRA),” which makes predictions using similarity-based read-across algorithms. In addition to toxicity endpoint predictions, this tool enables pairwise similarity calculations, read-across optimization, and the identification of important features related to read-across predictions. Similarity calculations assess how close compounds are based on their molecular descriptors. The read-across optimization feature helps determine the best hyperparameter values for the similarity measures. Furthermore, feature importance analysis evaluates the relative significance of features involved in read-across prediction. This tool has been validated using three small datasets (≤ 30 samples) containing nanotoxicity data. External validation metrics show improvements over previously reported models across all datasets. These results demonstrate the effectiveness of this similarity-based read-across method. Consequently, the developed tool can be used for accurate prediction of the toxic potential and prioritization of data-poor NPs.

Keywords: Nanoparticle (NP), Read-across, Read-across tool, Cheminformatics, Nanoinformatics

Graphical Abstract

Highlights

•
The widespread use of nanoparticles presents substantial risks to human health and other organisms.
•
In this study, we introduce a tool called “intelligent Read Across” (iRA) for evaluating nanoparticle toxicity.
•
This tool follows the basic methodology of similarity-based read-across approaches to perform predictions.
•
The tool was validated using three small datasets of nanotoxicity.
•
This tool is also capable of identifying the structural characteristics and properties that contribute to toxicity.

1. Introduction

In recent years, nanoparticles (NPs) have emerged as revolutionary tools across a wide range of fields due to their distinctive physical and chemical properties [1]. NPs are available in various types, with metal oxide nanoparticles (MONPs) representing approximately 80 percent of this category [2]. MONPs are crucial owing to their notable mechanical, electronic, magnetic, catalytic, and photocatalytic characteristics [3]. These nanoparticles have been utilized in numerous industrial and consumer products, including electrochemical sensors [4], environmental engineering solutions [5], [6], cosmetics [7], and medical applications [8], with TiO₂ and ZnO being frequently incorporated into paints and sunscreens [7], [9]. The extensive application of NPs has led to their release into the environment, where they may disperse through multiple pathways. Their distinctive small size enables them to infiltrate biological systems via the penetration of cell membranes, inhalation, and ingestion. Upon interacting with intracellular substances and macromolecules, NPs can exhibit various adverse effects [10]. Moreover, they may induce cytotoxicity within biological systems, thereby presenting significant health risks to humans and ecological systems [11], [12]. Consequently, it is important to evaluate the hazardous potential of NPs to inform the development of safety guidelines.

Traditionally, the risk assessment of the NPs has mainly been performed using in vivo and in vitro experimental tests [13]. However, conducting experimental studies can be impractical due to constraints related to cost, time, and the inadequacy of such studies to provide sufficient safety measures for newly developed materials in the rapidly evolving market of NPs. In this context, computational methods have emerged as an effective alternative to traditional experimental procedures, facilitating the reduction of the number of experiments, time investments, costs, and resource expenditures [14], [15]. The regulatory bodies worldwide have endorsed the development and implementation of new approach methodologies (NAMs) aimed at evaluating the hazardous potential of NPs and establishing safety guidelines. NAMs encompass a variety of animal-free computational techniques for hazard assessment of NPs, with in silico methods emerging as the most promising starting point [16], [17]. The in silico methods can predict the toxicity of NPs before their development, thereby ensuring that the process remains entirely animal-free and reduces associated expenses.

Among the various in silico methods, quantitative read-across is one of the most significant approaches utilized for the toxicological assessment of nanomaterials (NPs) [18]. Read-across is a similarity-driven, non-statistical technique used to fill data gaps. This method leverages information from a source chemical to predict the endpoint of a target chemical that shares some relevant characteristics [19]. Read-across is particularly advantageous in cases where small datasets are involved, as alternative methods, such as quantitative structure-activity relationships (QSAR), may yield unreliable results due to limited degrees of freedom [20]. This method has gained considerable popularity since its introduction, particularly through regulations such as the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH), for use in regulatory decision-making. This methodology employs one of the four distinct strategies, such as one-to-one (one source chemical is used to predict the endpoint of one target chemical), one-to-many (one source chemical is utilized to predict the endpoints of two or more target chemical), many-to-one (two or more source chemicals are employed to predict the endpoint of one target chemical), and many-to-many (two or more source chemicals are used to predict the endpoints of two or more target chemicals), for predicting endpoints for the target chemical. However, using multiple source chemicals for predictions is generally more reliable than relying on a single source chemical, as it yields more accurate and reliable outcomes [21]. Typically, read-across predictions are derived through either the analogue approach or the category approach. The analogue approach uses data from one or more similar chemicals (analogues) to predict a single target chemical, while the category approach evaluates a group of related chemicals, using shared properties or common patterns in one or more physico-chemical, toxicological, and/or ecotoxicological properties to fill data gaps across the group [22]. The read-across method also presents various types of problems, with the two main issues being the absence of toxicity and uncertainty. The absence of toxicity places a greater burden on justifying predictions. The second issue concerns uncertainty and the extent to which results can be deemed reliable. The reliability of read-across predictions can be assessed through various factors, including the quality and quantity of experimental data, similarity measures employed, knowledge of how chemicals interact with biological systems, and additional data from in vitro assays. Although this information is not always available, it can enhance the weight of evidence (WoE) approach [23], [24].

A critical step in the read-across prediction process is identifying the source compounds. The search for source chemicals can be conducted based on similarity, which may involve molecular structure or other common characteristics, such as shared precursors or metabolic products, or a consistent pattern in potency variation across groups. These criteria can be applied individually or integrated to enhance the grouping process. In instances where the aforementioned criteria are inapplicable, a similarity approach may be employed for grouping, wherein pair-wise similarity is calculated to identify potential source compounds [24], [25].

A variety of tools are available for read-across prediction, including the OECD QSAR Toolbox [26], ToxRead [23], ToxMatch [27], AIM [28], Apellis [29], CBRA [30], CIIpro [31], GenRA [32], and AMBIT [33]. These tools share several common methodologies for deriving predictions. Most of them accept input formats such as SMILES, CAS IDs, and molecular structure, while only a few accommodate descriptor input. Generally, these tools provide predictions for a limited set of predefined endpoints, including mutagenicity, bioconcentration factor, and repeated dose toxicity. However, certain tools, notably ToxMatch, OECD QSAR Toolbox, CBRA, Apellis, and CIIPro, offer predictions based on user-specified endpoints.

In this study, we present a newly developed tool known as “intelligent Read Across” (iRA), which is designed to undertake various tasks associated with read-across methodologies. This tool adheres to the fundamental principles described by Roy et al. for quantitative read-across predictions [19], [34]. Our work extends their framework by incorporating a range of similarity methods applicable to the read-across prediction of both continuous and categorical endpoints. Additionally, we have introduced an auto-selection algorithm that identifies source compounds for specific target compounds based on various error-based objective functions. Furthermore, we have developed a novel algorithm to determine the relative importance of features involved in generating the read-across predictions. The tool provides a user-friendly graphical interface that enables users to perform various functions, including similarity calculation, read-across prediction, read-across optimization, and assessment of feature importance in read-across applications. The proposed methodology has been validated using three high-quality nano-toxicity datasets obtained from the literature. However, our methodology can also be applied to other types of chemicals, such as organic molecules and peptides.

2. Methods

In the current study, we present a newly developed tool called “intelligent Read Across” (iRA) designed to execute four distinct tasks: similarity calculation, read-across prediction, read-across optimization, and calculation of read-across feature importance. The similarity calculation involves computing the pairwise similarity between two data matrices, which encompass structural information represented through descriptors. Read-across prediction is conducted based on the similarity weight assigned to the endpoint values of source compounds, accommodating both continuous and categorical endpoint types. Additionally, we have implemented a novel strategy for selecting source compounds. The read-across optimization aims to identify the optimal values of hyperparameters associated with various similarity methods, as well as the number of source compounds and the most appropriate similarity measures. Furthermore, we introduce an innovative methodology for assessing the relative importance of features involved in read-across prediction. A detailed explanation of these four tasks is provided as an integral component of the read-across framework.

2.1. Read-Across framework

The read-across approach is a similarity-based method that utilizes the similarity between source (training set) and target (test set) compounds to make predictions. The read-across prediction involves two major steps: (i) identifying the close source/training compounds (CTC), and (ii) performing a similarity-derived weighted average prediction. The identification of close source compounds is achieved by calculating the pairwise similarity between source and target compounds. Then, the similarity weighting of the endpoint values of the close source compounds is used to generate the read-across prediction.

2.1.1. Similarity calculation

The first step of the read-across involves finding the close source compounds for each target compound set, which is achieved by calculating the similarity between them. The similarity between two chemicals can be assessed in multiple ways, such as chemical vs. molecular similarity, 2D vs. 3D similarity, molecular vs. biological similarity, and global vs. local similarity. Although the terms "chemical" and "molecular" similarity are often used interchangeably, this is not entirely accurate. Chemical similarity is primarily computed using the physicochemical properties of a compound (MW, LogP, solubility, etc.), whereas molecular similarity between two chemicals is determined based on their structural features (ring system, topological distance, substructure frequency, etc.) [35]. Similarity can also be assessed based on molecular representation (2D or 3D). The 2D similarity is derived from the molecular graph, where similarity is evaluated by comparing the molecular graphs of two chemicals [36]. Currently, 2D similarity is determined by a set of descriptors that encode the graph information, such as fingerprints, topological distances, or substructure frequency. The 3D similarity between chemicals is assessed by comparing the conformations and associated properties of the molecules [37]. Biological similarity between chemicals reflects their similarity in activity against a particular target. This similarity is evaluated in the target space rather than the chemical space and is more challenging to implement compared to structure-based similarity, as the specific activity values for a chemical may not always be available [38]. Global similarity between two chemicals is assessed by comparing the chemicals entirely through calculated structural descriptors. In contrast, local similarity is evaluated by comparing specific functional groups, atoms, or functionalities, as done in pharmacophore modeling [35]. Here, the developed tool is designed to compute the pairwise 2D molecular similarity based on the calculated structural information and physicochemical properties.

In this tool, we have implemented twenty-seven different types of similarity measures, which can be calculated independently or as a part of the read-across algorithm. For the read-across method, similarity values are generated using standardized descriptor values, where standardization is performed using the mean and standard deviation of the training (source compound) set, as shown in Eq. 1.

X^{'} = \frac{X - μ}{σ}

(1)

Here, $X$ and $X^{'}$ represent the non-standardized and standardized descriptor value, respectively, while $μ$ and $σ$ are the mean and standard deviation of that descriptor.

The similarity measures used in this tool can be classified into two major groups, based on the variable type: (i) similarity measures for continuous variables, and (ii) similarity measures for fingerprint-based variables. These similarity measures are represented mathematically in Table 1. As shown in this table, various distance-based measures are also employed in the read-across algorithm, indicating the degree of dissimilarity between molecules. These dissimilarity measures are easily converted into a similarity measure by using (2), (3). Eq. 2 is used when the distance measure has the possible values in the range of 0–1 (e.g., Soergel distance). On the other hand, Eq. 3 is used to measure distance, which has a value in the range of 0 to infinity (e.g., Euclidean distance). According to these equations, for two identical compounds, the distance is zero, resulting in a similarity of 1, and when the distance increases, the similarity value approaches zero [39], [40].

Similarity = 1 - Distance

(2)

Similarity = \frac{1}{1 + Distance}

(3)

Table 1.

Similarity measures used in this work.

Similarity measure for continuous variables^a		Similarity measure for fingerprint-based variables^b [41], [42], [43], [44]
Euclidean Distance [45]	$D_{AB} = \sqrt{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}$	Tanimoto/Jaccard coefficient	$S_{AB} = \frac{c}{a + b - c}$
SEuclidean Distance [43]	$D_{AB} = \sqrt{\sum_{i = 1}^{n} \frac{{(x_{i} - y_{i})}^{2}}{V_{i}}};$ V_i = variance of the i^th descriptor of the training set.	Dice Coefficient/Hodgkin Index	$S_{AB} = \frac{2 c}{a + b}$
Manhattan Distance [46]	$D_{AB} = \sum_{i = 1}^{n} \|x_{i} - y_{i}\|$	Euclidean Distance	$D_{AB} = \sqrt{a + b - 2 c}$
Chebyshev Distance [47]	$D_{AB} = \max_{i} (\|x_{i} - y_{i}\|)$	Manhattan/Hamming/City Block Distance	$D_{AB} = a + b - 2 c$
Minkowski Distance [48]	$D_{AB} = {(\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{p})}^{1 / p}$	Cosine similarity/Carbo Index	$S_{AB} = \frac{c}{\sqrt{ab}}$
Mahalanobis Distance [49], [50]	$D_{AB} = \sqrt{{(X - Y)}^{T} S^{- 1} (X - Y)};$ X and Y are input vectors, and S is the sample covariance matrix	Russel-Rao Coefficient	$S_{AB} = \frac{c}{m}$
Canberra Distance [51]	$D_{AB} = \sum_{i = 1}^{n} \frac{\|x_{i} - y_{i}\|}{\|x_{i}\| + \|y_{i}\|}$	Forbes Coefficient	$S_{AB} = \frac{cm}{ab}$
Bray Curtis Distance [52]	$D_{AB} = \frac{\sum_{i = 1}^{n} \|x_{i} - y_{i}\|}{\sum_{i = 1}^{n} (x_{i} + y_{i})}$	Soergel Distance	$D_{AB} = \frac{a + b - 2 c}{a + b - c}$
Linear Kernel [43]	$S_{AB} = \sum_{i = 1}^{n} x_{i} y_{i}$	Matching Distance	$D_{AB} = \frac{a + b - 2 c}{m}$
Gaussian Kernel [45]	$S_{AB} = e^{- \frac{{‖x_{i} - y_{i}‖}^{2}}{2 σ^{2}}}$	Kulsinski Distance	$D_{AB} = \frac{a + b - 3 c + m}{a + b - 2 c + m}$
Laplacian Kernel [53]	$S_{AB} = e^{(- γ {‖x_{i} - y_{i}‖}_{1})}$	Rogers Tanimoto Coefficient/ Sokal Michener Distance	$D_{AB} = \frac{2 \times (a + b - 2 c)}{m + (a + b - 2 c)}$
Polynomial Kernel [43], [54], [55]	$S_{AB} = {(γ \sum_{i = 1}^{n} x_{i} y_{i} + C_{0})}^{d}$ C₀ = intercept d = kernel degree	Sokal Sneath Distance	$D_{AB} = \frac{a + b - 2 c}{a + b - 1.5 c}$
Sigmoid Kernel [56]	$S_{AB} = \tanh (γ \sum_{i = 1}^{n} x_{i} y_{i} + C_{0})$
Chi-square Kernel [43]	$S_{AB} = e^{- γ \sum_{i = 1}^{n} \frac{{(x_{i} - y_{i})}^{2}}{x_{i} + y_{i}}}$
Cosine Similarity [57]	$S_{AB} = \frac{\sum_{i = 1}^{n} x_{i} y_{i}}{\sqrt{\sum_{i = 1}^{n} {(x_{i})}^{2} \sum_{i = 1}^{n} {(y_{i})}^{2}}}$

Open in a new tab

Here, mathematical equations are represented by comparing two molecules (A and B). S indicates similarity, and D indicates the distance.

x_i and y_i represent the particular descriptor values for molecules A and B, respectively. Here, γ and σ are the hyperparameters of various kernel-based similarity measures.

Here, a represents the number of bits set to 1 in molecule A, b represents the number of bits set to 1 in molecule B, c indicates the number of bits set to 1 in both A and B, and m represents the total number of bits.

2.1.2. Read-Across predictions

As mentioned above, read-across is a similarity-based approach that predicts the response of the target compound by considering the similarity weights of the response variables of closely related source compounds. Our developed tool is designed to make read-across predictions for both continuous and categorical types of response variables. The steps involved in the read-across prediction algorithm are discussed below -

a.
Pair-wise similarity calculation: The descriptor-based molecular similarity between each target compound and source compound is calculated using the above-mentioned similarity measures (Table 1).
b.
Selection of the close source compounds: The specified number of close source compounds is selected for each target compound, which has the highest similarity with the target compound.
c.
Calculation of the weightage: The similarity weightage is calculated for the selected close source compound by dividing the individual similarity value by the total similarity between a target compound and all source compounds (Eq. 4).

W_{ij} = \frac{S_{ij}}{\sum_{m = 1}^{n} S_{im}}, m = 1, 2, 3, \dots ., j, \dots n

(4)

Here, $S_{ij}$ and $W_{ij}$ are the similarity and calculated weightage between the i^th target compound and the j^th source compound, respectively, and n is the total number of source compounds.

Weighted Average Prediction (WAP): The final prediction (or weighted average prediction) for the target compounds is calculated using the selected close source compound by the following equation:

{WAP}_{i} = \frac{\sum_{j = 1}^{CTC} W_{ij} \times Y_{j}}{\sum_{j = 1}^{CTC} W_{ij}}

(5)

Here, ${WAP}_{i} =$ weighted average prediction for the i^th target compound, $W_{ij} =$ calculated weightage between the i^th target compound and the j^th source compound, $Y_{j} =$ response value of the j^th source compound, and CTC = number of the selected close source/training compounds.

For the categorical response variable, the predicted values are categorized based on the specified classification threshold.

The complete algorithm of the read-across prediction is also represented in Fig. 1.

In this study, we also introduce a new method for the automatic selection of the close source compound in a target-specific manner. Previously, the specified number of CTCs was used to predict all the target compounds, meaning that if the CTC value is set at 8, then the eight most similar source compounds for each target compound were used for prediction [19], [34]. However, with this new approach, the CTC value is not fixed and can vary across the target compounds. In this method, WAP is calculated at various CTCs for each target compound, and then the error-based objective function (Eq. 6 or Eq. 7) is determined between the WAP and the response values of the close source compounds used for that prediction. Subsequently, the suitable CTC values are selected for each target compound based on the minimum value of the objective function.

objective function 1 = \frac{1}{CTC} \sum_{i = 1}^{CTC} |WAP - Y_{i}|

(6)

objective function 2 = \frac{1}{CTC} \sum_{i = 1}^{CTC} {(WAP - Y_{i})}^{2}

(7)

This processing could be exhaustive if the source or training set contains a high number of compounds, so the upper limit of the CTC is currently set at 10.

The whole process of auto CTC selection is also shown in Fig. 2.

If the response values of the target compounds are available, this tool can validate the prediction obtained from the read-across algorithm. For the validation, different external validation metrics are used, based on the type of response variable (continuous or categorical). These external validation metrics are represented from Equations S1 to S13 of Supplementary Material 1 (SM1).

This tool also generates eighteen different similarity and error measures for each target compound based on the close source compounds [34]. The mathematical formula and descriptions of each similarity measure are presented in Table 2.

Table 2.

Read-across derived similarity and error measures.

Read-Across measures	Mathematical formula	Description
n_effective (n_eff)	$n_{eff} = \frac{{(\sum_{i = 1}^{n} w_{i})}^{2}}{\sum_{i = 1}^{n} ({w_{i}}^{2})}$ w_i = weightage of the i^th source compound for a target compound, n = number of close source compounds.	It represents the effective sample size, describing the actual number of independent data points in a dataset when the data points are not all equally weighted.
SD_Activity (S_weighted)	$S_{weighted} = \sqrt{\frac{\sum_{i = 1}^{n} w_{i} {(x_{i} - \overset{®}{x_{wtd}})}^{2}}{\sum_{i = 1}^{n} w_{i}} \times \frac{n_{eff}}{n_{eff} - 1}} \overset{®}{x_{wtd}} = \frac{\sum_{i = 1}^{n} {x_{i} w}_{i}}{\sum_{i = 1}^{n} w_{i}}$ x_i = response value of the i^th source compound for a target compound	It represents the weighted standard deviation of the observed response value of the selected close source compounds for each target compound.
CV_Activity	$C V_{activity} = \frac{S_{weighted}}{\overset{®}{x_{wtd}}}$	Coefficient of variation of the observed response value of the selected close source compounds for each target compound.
Standard_Error (SE)	$SE = \frac{S_{weighted}}{\sqrt{n_{eff}}}$	Standard error of the observed response of the selected close source compounds. It indicates the expected variation of a sample statistic from the true population parameter.
Avg_similarity	$Av g_{similarity} = \frac{\sum_{i = 1}^{n} S_{i}}{n}$ S_i = similarity between the i^th source compound and a target compound	The average of the similarity values between each target compound and selected source compounds.
SD_similarity	$S D_{similarity} = \sqrt{\frac{\sum_{i = 1}^{n} {{(S}_{i} - {Avg}_{Similarity})}^{2}}{n - 1}}$	The standard deviation of the similarity values between each target compound and selected source compounds.
CV_similarity	$C V_{similarity} = \frac{S D_{similarity}}{Av g_{similarity}}$	The coefficient of variance of the similarity values between each target compound and selected source compounds.
MaxPos	-	The maximum similarity value of a positive close source compound for each target compound (The selected close source compounds are categorized into two classes – (i) positive, which have a response value greater than or equal to the training set mean, and (ii) negative, which have a response value less than the training set mean).
PosAvgSim	-	The average of the similarity of the positive close source compounds.
MaxNeg	-	The maximum similarity value of a negative close source compound for each target compound.
NegAvgSim	-	The average of the similarity of the negative close source compounds.
AbsDiff	$AbsDiff = \|MaxPos - MaxNeg\|$	The absolute difference between MaxPos and MaxNeg.
g	$g = 1 - 2 \times \|PosFrac - 1 / 2\|$	The concordance measure.
g_m	$g_{m} = {(- 1)}^{n} \times 2 \|PosFrac - 1 / 2\|;$ n = a positive integer which is either 1 (when MaxPos < MaxNeg) or 2 (when MaxPos ≥ MaxNeg).	Banerjee-Roy Coefficient
g_m*Avg_similarity	-	The product of g_m and Avg_similarity.
g_m*SD_similarity	-	The product of g_m and SD_similarity.
s_m¹	${s_{m}}^{1} = \frac{(MaxPos - MaxNeg)}{argmax (MaxPos, MaxNeg)}$	Banerjee-Roy similarity coefficient 1
s_m²	${s_{m}}^{2} = \frac{PosAvgSim - NegAvgSim}{Av g_{similarity}}$	Banerjee-Roy similarity coefficient 2

Open in a new tab

2.1.3. Read-across optimization

This tool is also designed to perform read-across optimization, which helps to select the optimal hyperparameter values involved in the read-across prediction process. The hyperparameters are settings or configurations that control the prediction process but are not dependent on the data. Hyperparameter optimization (or hyperparameter tuning) is the process of finding the best set of hyperparameters to improve the predictive performance. Since different hyperparameter values can lead to significantly different results, selecting optimal values ensures that the model generalizes well to new data. The hyperparameter optimization not only helps to reduce overfitting but also could improve predictive accuracy.

This tool performs read-across hyperparameter optimization using the k-fold cross-validation (kFoldCV) method. In this approach, a training set is divided into k equal-sized subsets (or folds), where the model is trained k times, each time utilizing k-1 folds for training and the remaining fold for validation (Figure S1). During this process, every point is used once as a test set member, providing a more reliable estimate of performance. The results obtained from each iteration are averaged, which helps reduce overfitting and offers a robust evaluation of the model’s predictive ability. Here, kFoldCV was executed with various combinations of hyperparameters and CTCs across different similarity methods, and the best hyperparameter combination is selected based on the corresponding objective function. This tool utilizes various objective functions for optimization, depending on the response variable type: categorical (accuracy, F1-score, Cohen’s κ, and Matthews correlation coefficient) or continuous (mean absolute error, mean squared error, and root mean squared error).

2.1.4. Read-Across feature importance analysis

In this tool, we have also developed a new methodology for determining the feature importance used for read-across prediction. This feature importance method is primarily derived from the concept of Shapley value [58], [59]. The Shapley value is employed for calculating fair distribution in cooperative game theory, first introduced by Lloyd Shapley. The concept of Shapley value is also utilized in the field of Explainable AI for determining feature importance in machine learning models [60], [61], [62].

The determination of feature importance using the Shapley value is a prediction-oriented process, where feature importance is established by comparing predictions when a feature is included with predictions when that feature is omitted. This prediction difference is considered for every feature combination. The final feature importance is determined using Eq. 8:

{RA}_{IMP (i)} = \sum \frac{|S|! (|N| - |S| - 1)!}{|N|!} (WAP (S \cup {i}) - WAP (S))

(8)

Here, ${RA}_{IMP (i)}$ = read-across feature importance for the i^th feature, N = total number of features, S = number of features in a subset, $WAP (S \cup {i})$ = weighted average prediction by including the i^th feature, $WAP (S)$ = weighted average prediction by removing the i^th feature.

In this analysis, feature importance is determined solely using the training set (source compounds). This process can become exhaustive as the number of features increases. To manage this complexity, we employ a threshold value (subset length) to limit the number of feature combinations, as a smaller set of features may lead to unreliable predictions. Feature importance can be assessed in two ways: locally (for a specific data point) and globally (overall importance). The algorithm for calculating local feature importance is illustrated in Fig. 3. The calculated local feature importance is represented as a bar plot (Figure S2a), where the size of each bar represents the importance of the feature, and the direction of its contribution is indicated by color: red for negative contributions and blue for positive contributions. Global importance is calculated by taking the mean of the absolute values of local importance across all data points. This global importance is represented in two types of plots: a column plot and a summary plot. In the column plot (Figure S2b), features are arranged along the X-axis based on their importance, from left to right. In the summary plot (Figure S2c), features are arranged along the Y-axis based on their importance, while the X-axis displays dots representing local importance values for each data point. These dots are colored according to the feature values to illustrate how importance changes with feature value.

Fig. 3 — Algorithm for read-across feature importance calculation locally.

3. Results and discussions

3.1. Implementation of the developed software

In this work, we have developed a Python-based software that provides a user-friendly interface for performing four different tasks: similarity calculation, read-across prediction, read-across optimization, and read-across feature importance determination. The software can be accessed from the official website of the DTC Lab tools (https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/intelligent-read-across). The graphical user interface (GUI) for the four different types of tasks is represented in Fig. 4. This tool allows users to perform tasks by giving Excel inputs (.xlsx format) that contain information about a chemical in the form of molecular descriptors. The output file is also generated in Excel format, containing the results of the corresponding tasks.

The GUI for the similarity calculation (Fig. 4a) requires two inputs: one for source compounds and another for target compounds, both of which contain structural information in the form of molecular descriptors. Users can choose the similarity methods from a dropdown menu, and a pair-wise similarity matrix is generated in the Excel file. Additionally, users can scale the input files using various methods, such as standardization or min-max normalization, before performing the similarity calculation.

The GUI for read-across prediction (Fig. 4b) requires two input files: one for the source compounds and one for the target compounds, both in Excel file format. Users select a similarity method for the read-across prediction from the dropdown menu, and the corresponding hyperparameter input is collected. Other required inputs include the similarity threshold and the number of CTCs. For automatic CTC selection in the read-across prediction, the CTC input box accepts “auto” as input. For categorical types of endpoints, a classification threshold is needed to categorize the predicted values into two groups. The output of the read-across prediction is provided in Excel file format, comprising three sheets. The first sheet contains predicted values, the second sheet includes calculated external validation metrics (if response values for the target compounds are available), and the third sheet lists various similarity and error measures.

The GUI for read-across optimization (Fig. 4c) uses only the source compounds (training set). Other inputs, such as similarity methods, hyperparameter settings, cross-validation folds, and the type of objective function, should also be provided. Default values are available in the respective input boxes. The objective function should be selected based on the type of response parameters. For continuous response values, different error-based objective functions are employed (e.g., MAE, MSE, RMSE), whereas for categorical response values, various classification-based metrics are utilized (e.g., accuracy, F1-score). The output results are generated in an Excel file, where each sheet contains the optimized results across different hyperparameters and CTC combinations, for each similarity method.

The GUI for determining the importance of the read-across feature (Fig. 4d) requires only the source compound set as input, which contains both structural information and response values. The plot type is selected according to the user's preference for accessing feature importance (global or local). Other inputs, including the similarity method, CTC, and hyperparameter settings, are provided based on the read-across prediction configurations. Additionally, a threshold input is necessary, representing the subset length of the feature combinations to reduce computational complexity. The generated plots are displayed in the plot section and also saved in the output folder. The calculated importance values are generated in the Excel files. Additionally, the user can access the local importance of the features for the target compounds by providing input that contains the structural information.

3.2. Case studies

In this study, we validated our developed tool using three different datasets collected from the literature. The first dataset was compiled from the work of Merugu et al. [63], who modeled the impact of multi-walled carbon nanotubes on lung pathologies and atherosclerosis using the kernel-weighted local polynomial regression (KwLPR) method. The second dataset was obtained from the work of Sifonte et al. [64], who developed a Nano-QSAR model to identify the toxicity exhibited by metal oxide nanoparticles towards human keratinous cells. The third dataset was collected from the work of Huang et al. [65]. In their study, they developed Nano-QSAR models to assess the inflammatory potential of metal oxide nanoparticles. They constructed a dataset of 30 MONPs to screen the proinflammatory cytokine interleukin beta (IL-1b) release in the THP-1 cell line. Although they developed two different types of models, we only used data from the regression-based model for the quantitative read-across predictions.

In this study, we utilized the same set of descriptors for read-across predictions and the same data set division as in the original studies to make a convenient comparison. All the collected datasets are detailed in Supplementary Material 2 (SM2). The “intelligent Read Across” tool used in this study is available in Supplementary Materials 3 (SM3).

3.2.1. Dataset 1

Dataset 1 was collected from the previous work of Merugu et al. [63], where they developed an AOP-anchored Nano-QSAR model to assess the lung pathologies and atherosclerosis potential shown by multiwalled carbon nanotubes. This dataset comprises a total of 14 nanoparticles (Table S1 of Supplementary Material 2), where 11 compounds are used as the training set for model construction, and the remaining three compounds serve as a test set for validating the model. Previously, the authors developed a QSAR model using aspect ratio and specific surface area of nanoparticles as independent variables. They applied the kernel-weighted local polynomial regression (KwLPR) to develop the model.

In this study, we employed the same training-test division and feature set for read-across predictions. Initially, the read-across prediction was performed across different similarity methods at the default hyperparameter settings. Here, various similarity measures demonstrate good external predictivity (Q²_F1 and Q²_F2 > 0.7), which includes the Chi-square kernel, Gaussian kernel, Laplacian kernel, Linear kernel, Polynomial kernel, Sigmoid kernel, and Cosine similarity methods. The external validation metrics for all these similarity measures are presented in Table S2 of the Supplementary Material 2 (SM2) file and also illustrated in Figure S3 of the Supplementary Material 1 (SM1). From this plot, we can see that the Gaussian kernel and Laplacian kernel yield superior results (Q²_F1 and Q²_F2 ∼ 0.9) compared to other similarity methods. For both the Gaussian kernel and the Laplacian kernel, prediction was performed with default hyperparameter settings, using a CTC value of 10 and σ (Gaussian kernel) and γ (Laplacian kernel) values of 1.

Here, we have performed a network analysis, generating a similarity network for both the Gaussian kernel (Fig. 5a) and the Laplacian kernel (Fig. 5b) methods. In this plot, the compounds in the test set serve as the source nodes (red color), while the compounds in the training set act as the target nodes (blue color). The thickness of the edges is adjusted based on the similarity values used in the read-across prediction. From these plots, we observe that most compounds in the training set display poor similarity with those in the test set, potentially leading to errors in the read-across predictions. We applied the auto-CTC selection algorithm to both the Gaussian kernel and Laplacian kernel methods. Here, we found that both methods enhance predictive power (Table S3 of Supplementary Material 2) compared with the default CTC value. However, the Gaussian kernel method yields superior results compared to the Laplacian kernel. This occurs due to the target-specific selection of CTCs, ensuring that source compounds with poor similarity are excluded from the read-across prediction. Additionally, we generated the similarity network (Figs. 5c and 5d) for the compounds involved in the auto-CTC selection. From this plot, we observe that for compound 12 (NRCWE-048), only two compounds are utilized for read-across prediction: compound 2 (NRCWE-026) and compound 3 (NRCWE-006), for both types of similarity methods. Similarly, for compound 14 (NRCWE-064), only two compounds are used for read-across prediction: compound 8 (NRCWE-062) and compound 11 (NRCWE-063). However, for compound 13 (NRCWE-047), seven compounds (NRCWE-049, NRCWE-043, NRCWE-046, NRCWE-045, NRCWE-062, NRCWE-061, and NRCWE-044) were utilized for prediction with the Gaussian kernel, and six compounds (NRCWE-049, NRCWE-043, NRCWE-046, NRCWE-045, NRCWE-061, and NRCWE-044) were used with the Laplacian kernel. We also note that compound 1 (NM 401) does not appear in the predictions for any of the methods.

Next, we also conducted read-across optimization for both the Gaussian kernel and Laplacian kernel methods using the training set. The optimization utilized the MAE-based objective function combined with a five-fold cross-validation method. We optimized the parameters by selecting σ and γ values ranging from 0.25 to 2 in increments of 0.25, CTC values from 2 to 10, including the auto setting, and similarity threshold values of 0.0, 0.1, and 0.2. Among the two similarity methods, the Gaussian kernel demonstrates the lowest MAE values. The optimization result for the Gaussian kernel method is shown in Table S4 of the Supplementary Material 2 (SM2). The optimal settings (CTC = 4, similarity threshold =0.1, σ =0.75) were then applied for the read-across prediction. Although the number of CTCs is set to 4, this value does not remain consistent for all the test compounds when the setting is applied. This inconsistency arises from the application of a similarity threshold of 0.1, which results in the removal of close source compounds with a similarity lower than 0.1. The compounds used in the prediction and their corresponding similarity values are indicated in Figure S4 of the Supplementary Material 1 (SM1). Here, the external validation metrics indicate improvement (Q²_F1 and Q²_F2 ∼ 0.97) compared to the default hyperparameter setting and performance of the auto-selection algorithm. These results are depicted in Table 3 alongside the results obtained with the default settings and those from the auto-selection algorithm, for the Gaussian kernel method.

Table 3.

External validation metrics of read-across prediction with Gaussian kernel method, at different conditions (Dataset 1).

Method	Q²_F1	Q²_F2	Q²_F3	MAE_test	RMSEP	CCC
Default setting	0.904	0.897	0.917	1.093	1.387	0.944
Auto CTC selection	0.959	0.956	0.964	0.851	0.909	0.979
Optimized Settings (CTC=4, similarity threshold =0.1, σ =0.75)	0.976	0.974	0.979	0.557	0.695	0.988

Open in a new tab

Finally, we have also conducted feature importance analysis to determine the relative significance of the features involved in the read-across prediction. Here, feature importance analysis was carried out using the optimized hyperparameter settings. Feature importance is assessed in two ways: global feature importance is calculated using the training set, while local feature importance is determined for the test set compounds. The global feature importance is illustrated in Figs. 6a and 6b, where the aspect ratio of the nanoparticles is shown to have greater significance than the specific surface area in terms of toxicity. The local feature importance plots for compounds 12 (NRCWE-048) (Fig. 6c) and 13 (NRCWE-047) (Fig. 6d) show that the aspect ratio is more important; however, for compound 12 it displays a negative contribution, whereas for compound 13 it is positive. For compound 14 (NRCWE-064) (Fig. 6e), both the aspect ratio and specific surface area contribute similarly in a positive direction. This analysis leads us to conclude that the aspect ratio of a nanoparticle has a greater influence on lung pathologies and atherosclerosis than the specific surface area.

Fig. 6 — Read-across feature importance (Dataset 1). (a) Global summary plot. (b) Global Column plot. (c) Local bar plot for compound 12. (d) Local bar plot for compound 13. (e) Local bar plot for compound 14.

3.2.2. Dataset 2

Dataset 2 was collected from the previous work of Sifonte et al. [64], which is represented in Table S5 of the Supplementary Material 2 (SM2) file. This dataset comprises 16 nanoparticles, with eleven compounds used as the training set and the remaining five as the test set. In their study, they developed a Nano-QSAR model to determine the toxicity of nanoparticles to human keratinous cells (HaCaT). For the development of the model, the previous authors used two quantum-mechanical descriptors – the enthalpy of the standard formation of metal oxide nanocluster (ΔH^c_f) and the absolute value of the Fermi energy from the cluster (∈^c_Fermi).

In this study, we used the same set of descriptors and the same training-test division for the read-across prediction. Here, read-across prediction was performed using the default hyperparameter settings across various similarity measures. Among these, cosine similarity, Gaussian kernel, Laplacian kernel, polynomial kernel, and sigmoid kernel demonstrated acceptable results (Q²_F1 and Q²_F2 > 0.6). However, cosine similarity outperformed the other methods, with Q²_F1 at 0.943 and Q²_F2 at 0.916. These external validation metrics for the predictions are also represented in Table S6 of Supplementary Material 2 (SM2) and Figure S5 of Supplementary Material 1 (SM1).

Although we have applied the default hyperparameter settings here, the read-across predictions using the cosine similarity measure were not conducted at the default CTC (i.e., 10). This occurred due to the differing value range of the descriptors (negative values), which results in similarity values that fall on the negative scale. Since the default similarity threshold is zero, this causes the elimination of source compounds with negative similarity values. The compounds involved in the read-across prediction, along with their similarity values, are represented in the similarity network (Fig. 7a). From this plot, we observe that for compounds 12 (Y₂O₃), 13 (V₂O₃), and 14 (Bi₂O₃), only seven compounds are used for the prediction. In contrast, for compounds 15 (La₂O₃) and 16 (In₂O₃), only four compounds are used for the prediction. Notably, compounds 1 (TiO₂) and 9 (Sb₂O₃) were absent from the predictions for any test set compounds. This plot also shows that all the training compounds share a significant level of similarity with the test set compounds. Therefore, when we applied the auto-CTC selection algorithm using the cosine similarity method, it resulted in a decrease in performance (Q²_F1 = 0.826 and Q²_F2 = 0.741) due to the removal of training compounds with higher similarity values. We also generated a similarity network (Fig. 7b) for the compounds involved in the auto-CTC selection algorithm. This plot indicates that compounds 4 (ZnO) and 8 (CoO) were involved in the prediction of compounds 15 and 16; compounds 5 (Al₂O₃) and 10 (Cr₂O₃) were involved in the prediction of compounds 12 and 13; while compounds 3 (SnO₂), 7 (NiO), 8 (CoO), and 11 (ZrO₂) were involved in the prediction of compound 14. The pair-wise similarity values involved in the read-across prediction are also provided in Table S7 of the Supplementary Material 2 (SM2) file.

Fig. 7 — Similarity network (Dataset 2). (a) Similarity network for the cosine similarity method. (b) Similarity network for the cosine similarity method with auto-CTC selection.

Here, we also performed read-across optimization for the cosine similarity method using the MAE-based objective function along with a five-fold cross-validation approach. The optimization was carried out by varying the CTC values from 2 to 10, alongside the auto setting, and using similarity thresholds of 0.0, 0.1, and 0.2. The results of the optimization are shown in Table S8 of Supplementary Material 2 (SM2). We applied the optimization settings (CTC = 5, Similarity Threshold = 0.1) for predicting the original test set compounds. Although CTC value five is utilized, the introduction of a similarity threshold of 0.1 results in the elimination of some source compounds. Additionally, the similarity network (Figure S6) was generated here to depict the compounds involved in the prediction. The external validation metrics at the optimized setting demonstrate a slight reduction compared to the default setting, which can be attributed to the removal of certain source compounds due to lower similarity. The external validation metrics at the optimized setting are presented in Table 4, alongside results from the default setting and the results using the auto CTC selection algorithm.

Table 4.

External validation metrics of read-across prediction with the Cosine similarity method, at different conditions (Dataset 2).

Method	Q²_F1	Q²_F2	Q²_F3	MAE_test	RMSEP	CCC
Default setting	0.943	0.916	0.961	0.067	0.088	0.948
Auto CTC selection	0.826	0.741	0.88	0.138	0.153	0.921
Optimized Settings (CTC=5, similarity threshold =0.1)	0.926	0.891	0.949	0.09	0.100	0.929

Open in a new tab

A feature importance analysis was performed to assess the relative importance of the features. This analysis utilized the training set to determine the global importance and the test set to evaluate local importance, using the default hyperparameter settings. The global feature importance is illustrated in Figs. 8a and 8b, which demonstrate that the Fermi energy of the nanocluster is more important than the formation enthalpy in determining toxicity towards human keratinous cells (HaCaT), a finding also evident in previous studies. The local feature importance, depicted in Figs. 8c through 8g, indicates that the Fermi energy consistently holds greater importance than the formation enthalpy for all compounds. However, for compounds 12 and 13, the Fermi energy has a negative contribution, while for compounds 14, 15, and 16, it contributes positively to toxicity.

Fig. 8 — Read-across feature importance analysis (Dataset 2). (a) Global summary plot. (b) Global Column plot. (c) Local bar plot for compound 12. (d) Local bar plot for compound 13. (e) Local bar plot for compound 14. (f) Local bar plot for compound 15. (g) Local bar plot for compound 16.

3.2.3. Dataset 3

Dataset 3 was collected from the previous work of Huang et al. [65]. In their study, the authors developed a Nano-QSAR model to predict the inflammatory potential of MONPs. They built a dataset of 30 different NPs and screened them to identify the proinflammatory cytokine interleukin-1 beta (IL-1β) release in the THP-1 cell line. They created two types of models – one with a categorical response variable and the other with a continuous response variable. However, we only used the continuous data for the read-across prediction. For developing the continuous model, they used only 26 NPs since four out of 30 NPs exhibited outlier behaviors. They randomly divided the dataset into two parts; 20 compounds were used for training the model, while the remaining six compounds were used for validation. Their final model consists of three descriptors: χ_me (electronegativity of metal atoms), ζ-potential, and D_water (hydrodynamic diameter). Here, we used the same set of descriptors and the same training-test splits for the read-across prediction. The collected data is represented in Table S9 of the Supplementary Material 2(SM2) file.

Initially, read-across predictions were conducted using the default hyperparameter settings, and various methods yielded promising results, as shown in Figure S7 of Supplementary Material 1 (SM1) and Table S10 of Supplementary Material 2 (SM2). All models demonstrated good performance, with Q²_F1 and Q²_F2 values greater than 0.75. Among these, the Gaussian kernel similarity methods achieved the best results, with Q²_F1 = 0.928 and Q²_F2 = 0.927, outperforming the other methods. The Laplacian kernel, linear kernel, and polynomial kernel similarity methods produced comparable results. Additionally, we applied the auto-CTC selection algorithm in conjunction with the Gaussian kernel method, resulting in improved external validation metrics. This enhancement may result from the exclusion of source compounds with lower similarity values. This is further illustrated by the similarity networks presented in Fig. 9, generated using both the default settings and the auto CTC selection algorithm. The pairwise similarity values used in the read-across predictions are detailed in Table S11 of Supplementary Material 2 (SM2).

Fig. 9 — Similarity network (Dataset 3). (a) Similarity network for the Gaussian kernel similarity method. (b) Similarity network for the Gaussian kernel method with the auto CTC selection method.

We have also performed the read-across optimization using the Gaussian kernel similarity method. The optimization was performed by varying CTC values from 2 to 10, along with auto and sigma values between 0.25 and 2, incrementing by 0.25, and using similarity thresholds of 0.0, 0.1, and 0.2. However, when we applied the optimal settings, it resulted in a "compound < 2" error due to the use of the similarity threshold of 0.1, which removed the source compounds with a similarity lower than 0.1. Therefore, the next optimal settings (CTC = 2, similarity threshold = 0.0, σ = 2.0) (Table S12 of Supplementary Material 2 (SM2)) were chosen for the read-across prediction of the original test set compounds. The read-across prediction with optimized settings resulted in a slight reduction in the external validation metrics (Q²_F1 = 0.923 and Q²_F2 = 0.922). The external validation metrics for the read-across prediction, using the default settings, the auto CTC selection algorithm, and optimized settings, are presented in Table 5. Additionally, we generated a similarity network for the compounds involved in the prediction, which is illustrated in Figure S8 of Supplementary Material 1 (SM1).

Table 5.

External validation metrics of read-across prediction with Gaussian kernel method, at different conditions (Dataset 3).

Method	Q²_F1	Q²_F2	Q²_F3	MAE_test	RMSEP	CCC
Default setting	0.928	0.927	0.921	3.600	4.417	0.960
Auto selection	0.946	0.946	0.941	2.655	3.816	0.971
Optimized Settings (CTC=2, similarity threshold =0.0, σ =2.0)	0.923	0.922	0.915	3.034	4.581	0.961

Open in a new tab

Finally, we performed the read-across feature importance analysis to determine the contribution of the features to the response parameters. The feature importance analysis was conducted with default settings (similarity threshold = 0.0, σ = 1.0) using the auto-CTC selection algorithm. Fig. 10 represents the global feature importance, while Figure S9 of Supplementary Material 1(SM1) shows the local importance for the test set compounds. The global plots indicate that χ_me has the highest importance, whereas ζ-potential has the least importance.

Fig. 10 — Read-across feature importance (Dataset 3). (a) Global summary plot. (b) Global Column plot.

3.3. Comparison with previous studies

This study presents a comparative analysis of the findings from the current research with previous studies. The comparison was performed based on the external validation metric of root mean squared error of prediction (RMSEP), as it was a common metric reported by the authors of the previous studies. The analysis was conducted using our best results, which are detailed in Table 6. The results indicate that our validation metrics significantly outperform those of the previous studies. The data across all datasets demonstrate that our similarity-based read-across algorithm yields superior predictions compared to earlier Nano-QSAR models, exhibiting the least error. Therefore, we conclude that this similarity-based algorithm can be used for enhancing the predictive accuracy of toxicity assessments for new NPs.

Table 6.

Comparison between previous studies and current studies.

Authors	RMSEP (previous study)	RMSEP (current study)
Merugu et al. [63]	1.080	0.695
Sifonte et al. [64]	0.180	0.088
Huang et al. [65]	7.390	3.816

Open in a new tab

4. Conclusion

The widespread use of nanoparticles presents greater risks to human health and other species. Their small size allows them to infiltrate biological systems and cause adverse effects, such as cytotoxicity, by binding to intracellular substances. Therefore, it is crucial to evaluate the potential toxicity of nanoparticles. However, experimental determination of toxicity involves ethical considerations and requires a significant amount of time and resources. Thus, an alternative method needs to be established. Computational methods have emerged as a promising alternative to experimental procedures. In this study, we developed a Python-based tool called “intelligent Read Across” (iRA), which provides predictions using a similarity-based read-across algorithm. Additionally, this tool can perform other tasks, such as pair-wise similarity calculations, read-across optimization, and read-across feature importance analysis. The similarity calculation is performed based on calculated molecular descriptors that describe the closeness of the compounds. Read-across optimization involves selecting optimal values for the hyperparameters present in different similarity measures. The read-across feature importance analysis determines the relative importance of the features involved in read-across predictions, following the basic principles of SHAP feature importance. This tool also implements a new auto-CTC selection algorithm for read-across predictions, which selects source compounds in a compound-specific manner. This algorithm can enhance the performance of read-across predictions by eliminating source compounds with lower similarity, which could otherwise increase errors. This tool was validated using three high-quality small (≤ 30) toxicity datasets. For all datasets, our read-across algorithms demonstrate better predictive performance compared to previous studies. Dataset 1 consists of 11 training compounds and 3 test set compounds, yielding external validation metrics like Q²_F1 = 0.976 and Q²_F2 = 0.974. Dataset 2 comprises 11 training compounds and 5 test set compounds, with external validation metrics of Q²_F1 = 0.943 and Q²_F2 = 0.916. Dataset 3 includes 20 training compounds and 6 test set compounds, achieving external validation metrics of Q²_F1 = 0.946 and Q²_F2 = 0.946. These results suggest that our similarity-based read-across algorithm can provide predictions with higher accuracy. Feature importance analysis for each dataset revealed that the aspect ratio of NPs in dataset 1, Fermi energy of the nanocluster in dataset 2, and χ_me (electronegativity of metal atoms) in dataset 3 have the highest importance among other descriptors. Thus, this tool can primarily serve two main purposes: (1) to characterize the structure and properties involved in toxicity and (2) to accurately predict the toxic potential of data-poor NPs. Although we have demonstrated the application of our tool for the nanotoxicity predictions of small data sets, this platform can perform predictions on any type of dataset, regardless of its size and chemistry, as long as the descriptors are available.

CRediT authorship contribution statement

Kunal Roy: Writing – review & editing, Supervision, Resources, Project administration, Conceptualization. Souvik Pore: Writing – original draft, Validation, Software, Investigation, Formal analysis, Data curation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

^{Appendix A}

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2025.07.032.

Appendix A. Supplementary material

Supplementary material

mmc1.docx^{(657.5KB, docx)}

Supplementary material

mmc2.xlsx^{(48.7KB, xlsx)}

Supplementary material

mmc3.zip^{(2.6MB, zip)}

Data availability

Additional data are provided in Supplementary Materials 1 (SM1) and 2 (SM2). ). The “intelligent Read Across” tool used in this study is available in Supplementary Materials 3 (SM3).

References

1.Teow Y., Asharani P.V., Hande M.P., Valiyaveettil S. Health impact and safety of engineered nanomaterials. Chem Commun. 2011;47:7025–7038. doi: 10.1039/C0CC05271J. [DOI] [PubMed] [Google Scholar]
2.Mu Y., Wu F., Zhao Q., Ji R., Qie Y., Zhou Y., et al. Predicting toxic potencies of metal oxide nanoparticles by means of nano-QSARs. Nanotoxicology. 2016;10:1207–1214. doi: 10.1080/17435390.2016.1202352. [DOI] [PubMed] [Google Scholar]
3.Mikolajczyk A., Gajewicz A., Mulkiewicz E., Rasulev B., Marchelek M., Diak M., et al. Nano-QSAR modeling for ecosafe design of heterogeneous TiO2-based nano-photocatalysts. Environ Sci Nano. 2018;5:1150–1160. doi: 10.1039/C8EN00085A. [DOI] [Google Scholar]
4.Wongkaew N., Simsek M., Griesche C., Baeumner A.J. Functional nanomaterials and nanostructures enhancing electrochemical biosensors and lab-on-a-chip performances: recent progress, applications, and future perspective. Chem Rev. 2019;119:120–194. doi: 10.1021/acs.chemrev.8b00172. [DOI] [PubMed] [Google Scholar]
5.Zhang Y., Zhang Y., Akakuru O.U., Xu X., Wu A. Research progress and mechanism of nanomaterials-mediated in-situ remediation of cadmium-contaminated soil: A critical review. J Environ Sci. 2021;104:351–364. doi: 10.1016/J.JES.2020.12.021. [DOI] [PubMed] [Google Scholar]
6.Lu T., Cui J., Qu Q., Wang Y., Zhang J., Xiong R., et al. Multistructured electrospun nanofibers for air filtration: a review. ACS Appl Mater Interfaces. 2021;13:23293–23313. doi: 10.1021/acsami.1c06520. [DOI] [PubMed] [Google Scholar]
7.Bocca B., Caimi S., Senofonte O., Alimonti A., Petrucci F. ICP-MS based methods to characterize nanoparticles of TiO2 and ZnO in sunscreens with focus on regulatory and safety issues. Sci Total Environ. 2018;630:922–930. doi: 10.1016/J.SCITOTENV.2018.02.166. [DOI] [PubMed] [Google Scholar]
8.Bajpai V.K., Shukla S., Kang S.M., Hwang S.K., Song X., Huh Y.S., et al. Developments of cyanobacteria for nano-marine drugs: relevance of nanoformulations in cancer therapies. Mar Drugs. 2018;16:179. doi: 10.3390/MD16060179. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Solano R., Patiño-Ruiz D., Herrera A. Preparation of modified paints with nano-structured additives and its potential applications. Nanomater Nanotechnol. 2020;10 doi: 10.1177/184798042090918. [DOI] [Google Scholar]
10.Cheng K., Pan Y., Yuan B. Cytotoxicity prediction of nano metal oxides on different lung cells via Nano-QSAR. Environ Pollut. 2024;344 doi: 10.1016/J.ENVPOL.2024.123405. [DOI] [PubMed] [Google Scholar]
11.Zhang H., Ji Z., Xia T., Meng H., Low-Kam C., Liu R., et al. Use of metal oxide nanoparticle band gap to develop a predictive paradigm for oxidative stress and acute pulmonary inflammation. ACS Nano. 2012;6:4349–4368. doi: 10.1021/nn3010087. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Meng H., Xia T., George S., Nel A.E. A predictive toxicological paradigm for the safety assessment of nanomaterials. ACS Nano. 2009;3(7):1620. doi: 10.1021/NN9005973. [DOI] [PubMed] [Google Scholar]
13.Fröhlich E. Comparison of conventional and advanced in vitro models in the toxicity testing of nanoparticles. Artif Cells Nanomed Biotechnol. 2018;46:1091–1107. doi: 10.1080/21691401.2018.1479709. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Banerjee A., Kar S., Pore S., Roy K. Efficient predictions of cytotoxicity of TiO2-based multi-component nanoparticles using a machine learning-based q-RASAR approach. Nanotoxicology. 2023;17:78–93. doi: 10.1080/17435390.2023.2186280. [DOI] [PubMed] [Google Scholar]
15.Roy J., Pore S., Roy K. Prediction of cytotoxicity of heavy metals adsorbed on nano-TiO2 with periodic table descriptors using machine learning approaches. Beilstein J Nanotechnol. 2023;14:939–950. doi: 10.3762/BJNANO.14.77. 14:77. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.U.S. Environmental Protection Agency. Strategic Plan to Promote the Development and Implementation of Alternative Test Methods Within the TSCA Program. 2018.
17.European Chemicals Agency. New approach methodologies in regulatory science. European Chemicals Agency; 2016. https://doi.org/doi/10.2823/543644.
18.Roy J., Roy K. Nano-read-across predictions of toxicity of metal oxide engineered nanoparticles (MeOx ENPS) used in nanopesticides to BEAS-2B and RAW 264.7 cells. Nanotoxicology. 2022;16:629–644. doi: 10.1080/17435390.2022.2132887. [DOI] [PubMed] [Google Scholar]
19.Chatterjee M., Banerjee A., De P., Gajewicz-Skretna A., Roy K. A novel quantitative read-across tool designed purposefully to fill the existing gaps in nanosafety data. Environ Sci Nano. 2022;9:189–203. doi: 10.1039/D1EN00725D. [DOI] [Google Scholar]
20.Ambure P., Gajewicz-Skretna A., Cordeiro M.N.D.S., Roy K. New workflow for QSAR model development from small data sets: small dataset curator and small dataset modeler. integration of data curation, exhaustive double cross-validation, and a set of optimal model selection techniques. J Chem Inf Model. 2019;59:4070–4076. doi: 10.1021/ACS.JCIM.9B00476. [DOI] [PubMed] [Google Scholar]
21.Schultz T.W., Amcoff P., Berggren E., Gautier F., Klaric M., Knight D.J., et al. A strategy for structuring and reporting a read-across prediction of toxicity. Regul Toxicol Pharm. 2015;72:586–601. doi: 10.1016/J.YRTPH.2015.05.016. [DOI] [PubMed] [Google Scholar]
22.OECD . Guidance on grouping of chemicals, OECD Series on Testing and Assessment, No. 1942017. 2nd ed. OECD Publishing; Paris: 2017. [DOI] [Google Scholar]
23.Gini G., Franchi A.M., Manganaro A., Golbamaki A., Benfenati E. ToxRead: A tool to assist in read across and its use to assess mutagenicity of chemicals. SAR QSAR Environ Res. 2014;25:999–1011. doi: 10.1080/1062936X.2014.976267. [DOI] [PubMed] [Google Scholar]
24.Manganelli S., Benfenati E. In: Methods in Molecular Biology. Benfenati E., editor. Springer; 2016. Use of read-across tools Silico methods for predicting drug toxicity; pp. 305–322. [DOI] [PubMed] [Google Scholar]
25.Patlewicz G., Helman G., Pradeep P., Shah I. Navigating through the minefield of read-across tools: A review of in silico tools for grouping. Comput Toxicol. 2017;3:1–18. doi: 10.1016/J.COMTOX.2017.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.QSAR Toolbox, 2025. 〈https://qsartoolbox.org/〉 (accessed May 18, 2025).
27.Patlewicz G., Jeliazkova N., Gallegos Saliner A., Worth A.P. Toxmatch - A new software tool to aid in the development and evaluation of chemically similar groups. SAR QSAR Environ Res. 2008;19:397–412. doi: 10.1080/10629360802083848. [DOI] [PubMed] [Google Scholar]
28.Analog Identification Methodology (AIM) Tool | US EPA n.d. 〈https://www.epa.gov/tsca-screening-tools/analog-identification-methodology-aim-tool〉 (accessed May 18, 2025).
29.Varsou D.D., Sarimveis H. Apellis: an online tool for read-across model development. Comput Toxicol. 2021;17 doi: 10.1016/J.COMTOX.2020.100146. [DOI] [Google Scholar]
30.Low Y., Sedykh A., Fourches D., Golbraikh A., Whelan M., Rusyn I., et al. Integrative chemical-biological read-across approach for chemical hazard classification. Chem Res Toxicol. 2013;26:1199–1208. doi: 10.1021/TX400110F. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Russo D.P., Kim M.T., Wang W., Pinolini D., Shende S., Strickland J., et al. CIIPro: a new read-across portal to fill data gaps using public large-scale chemical and biological data. Bioinformatics. 2017;33:464–466. doi: 10.1093/BIOINFORMATICS/BTW640. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Shah Helman G., Williams I., Edwards A.J., Dunne J., Patlewicz J. G. Generalized Read-Across (GenRA): A workflow implemented into the EPA CompTox Chemicals Dashboard. ALTEX. 2019;36:462–465. doi: 10.14573/ALTEX.1811292. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.AMBIT – Cheminformatics data management system – Cefic-Lri n.d. 〈https://cefic-lri.org/toolbox/ambit/〉 (accessed May 18, 2025).
34.Banerjee A., Chatterjee M., De P., Roy K. Quantitative predictions from chemical read-across and their confidence measures. Chemom Intell Lab Syst. 2022;227 doi: 10.1016/J.CHEMOLAB.2022.104613. [DOI] [Google Scholar]
35.Maggiora G., Vogt M., Stumpfe D., Bajorath J. Molecular similarity in medicinal chemistry. J Med Chem. 2014;57:3186–3204. doi: 10.1021/JM401411Z. [DOI] [PubMed] [Google Scholar]
36.Raymond J.W., Willett P. Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J Comput Aided Mol Des. 2002;16:521–533. doi: 10.1023/A:1021271615909/METRICS. [DOI] [PubMed] [Google Scholar]
37.Good A.C., Richards W.G. Excplicit calculation of 3D molecular similarity. Perspect Drug Discov Des. 1998;9(0):321–338. doi: 10.1023/A:1027280526177. [DOI] [Google Scholar]
38.Petrone P.M., Simms B., Nigsch F., Lounkine E., Kutchukian P., Cornett A., et al. Rethinking molecular similarity: comparing compounds on the basis of biological activity. ACS Chem Biol. 2012;7:1399–1409. doi: 10.1021/CB3001028. [DOI] [PubMed] [Google Scholar]
39.Todeschini R., Ballabio D., Consonni V. Distances and other dissimilarity measures in chemometrics. Encycl Anal Chem. 2015:1–34. doi: 10.1002/9780470027318.a9438. [DOI] [Google Scholar]
40.Urenda J.C., Kosheleva O., Kreinovich V. IEEE 11th International Conference on Intelligent Systems. 2022. Why 1/(1+d) Is an effective distance-based similarity measure: two explanations. [DOI] [Google Scholar]
41.Willett P., Barnard J.M., Downs G.M. Chemical similarity searching. J Chem Inf Comput Sci. 1998;38:983–996. doi: 10.1021/CI9800211/ASSET/IMAGES/LARGE/CI9800211F00002.JPEG. [DOI] [Google Scholar]
42.Cereto-Massagué A., Ojeda M.J., Valls C., Mulero M., Garcia-Vallvé S., Pujadas G. Molecular fingerprint similarity search in virtual screening. Methods. 2015;71:58–63. doi: 10.1016/J.YMETH.2014.08.005. [DOI] [PubMed] [Google Scholar]
43.DistanceMetric — scikit-learn 1.6.1 documentation n.d. 〈https://scikit-learn.org/stable/modules/generated/sklearn.metrics.DistanceMetric.html〉 (accessed May 18, 2025).
44.Bajusz D., Rácz A., Héberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Chemin. 2015;7:1–13. doi: 10.1186/S13321-015-0069-3/FIGURES/7. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Tabak J. Infobase Publishing; 2014. Geometry: the language of space and form. [Google Scholar]
46.Willett P., Barnard J.M., Downs G.M. Chemical similarity searching. J Chem Inf Comput Sci. 1998;38:983–996. doi: 10.1021/CI9800211. [DOI] [Google Scholar]
47.Rodrigues O. Combining minkowski and chebyshev: new distance proposal and survey of distance metrics using k-nearest neighbours classifier. Pattern Recognit Lett. 2018;110:66–71. doi: 10.1016/J.PATREC.2018.03.021. [DOI] [Google Scholar]
48.Foundations of Metric Space Searching. In: Similarity Search The Metric Space Approach. Advances in Database Systems 2006;32:5–66. 10.1007/0-387-29151-2_1. [DOI]
49.De Maesschalck R., Jouan-Rimbaud D., Massart D.L. The mahalanobis distance. Chemom Intell Lab Syst. 2000;50:1–18. doi: 10.1016/S0169-7439(99)00047-7. [DOI] [Google Scholar]
50.Mahalanobis P.C. On the generalized distance in statistics. Sankhyā. Indian J Stat Ser A (2008) 2018;80:1–7. (S) [Google Scholar]
51.Nagar S., Khunteta A. A Proposed Modification Over Learning Vector Quantization and K-Means Algorith ms for Performance Enhancement. Proceedings of the International Conference on Recent Cognizance in Wireless Communication & Image Processing. 2016. pp. 671–682. [DOI] [Google Scholar]
52.Ricotta C., Podani J. On some properties of the Bray-Curtis dissimilarity and their ecological meaning. Ecol Complex. 2017;31:201–205. doi: 10.1016/J.ECOCOM.2017.07.003. [DOI] [Google Scholar]
53.Chen J., Wang C., Sun Y. Shen X. Semi-supervised Laplacian regularized least squares algorithm for localization in wireless sensor networks. Comput Netw. 2011;55:2481–2491. doi: 10.1016/J.COMNET.2011.04.010. [DOI] [Google Scholar]
54.Boser B.E., Guyon I.M., Vapnik V.N. A training algorithm for optimal margin classifiers proceedings of the fifth annual workshop on computational learning theory. 1992. pp. 144–152. [Google Scholar]
55.Zhou D.X., Jetter K. Approximation with polynomial kernels and SVM classifiers. Adv Comput Math. 2006;25:323–344. doi: 10.1007/S10444-004-7206-2/METRICS. [DOI] [Google Scholar]
56.Camps-Valls G., Martín-Guerrero J.D., Rojo-Álvarez J.L., Soria-Olivas E. Fuzzy sigmoid kernel for support vector classifiers. Neurocomputing. 2004;62:501–506. doi: 10.1016/J.NEUCOM.2004.07.004. [DOI] [Google Scholar]
57.Sarwar B., Karypis G., Konstan J., Riedl J. Item-based collaborative filtering recommendation algorithms. Proceedings of the 10th international conference on World Wide Web. 2001. pp. 285–295. [Google Scholar]
58.Shapley L.A. Value for n-person games. Contributions to the theory of games II. Class Game Theory. 1953;1953:31–40. doi: 10.1515/9781400829156-012/HTML. [DOI] [Google Scholar]
59.Lundberg S.M., Allen P.G., Lee S.-I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30:4765–4774. [Google Scholar]
60.Pore S., Roy K. Insights into pharmacokinetic properties for exposure chemicals: predictive modelling of human plasma fraction unbound ( f u) and hepatocyte intrinsic clearance (Cl int) data using machine learning. Digit Discov. 2024;3:1852–1877. doi: 10.1039/D4DD00082J. [DOI] [Google Scholar]
61.Moncada-Torres A., van Maaren M.C., Hendriks M.P., Siesling S., Geleijnse G. Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Sci Rep. 2021;11:1–13. doi: 10.1038/s41598-021-86327-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Rodríguez-Pérez R., Bajorath J. Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions. J Comput Aided Mol Des. 2020;34:1013–1026. doi: 10.1007/s10822-020-00314-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Merugu S., Jagiello K., Gajewicz-Skretna A., Halappanavar S., Willliams A., Vogel U., et al. The impact of carbon nanotube properties on lung pathologies and atherosclerosis through acute inflammation: a new AOP-anchored in Silico NAM. Small. 2025;21 doi: 10.1002/smll.202501185. [DOI] [PubMed] [Google Scholar]
64.Sifonte E.P., Castro-Smirnov F.A., Jimenez A.A.S., Diez H.R.G., Martínez F.G. Quantum mechanics descriptors in a nano-QSAR model to predict metal oxide nanoparticles toxicity in human keratinous cells. J Nanopart Res. 2021;23:1–16. doi: 10.1007/s11051-021-05288-0. [DOI] [Google Scholar]
65.Huang Y., Li X., Xu S., Zheng H., Zhang L., Chen J., et al. Quantitative structure–activity relationship models for predicting inflammatory potential of metal oxide nanoparticles. Environ Health Perspect. 2020;128:1–13. doi: 10.1289/EHP6508. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.docx^{(657.5KB, docx)}

Supplementary material

mmc2.xlsx^{(48.7KB, xlsx)}

Supplementary material

mmc3.zip^{(2.6MB, zip)}

Data Availability Statement

Additional data are provided in Supplementary Materials 1 (SM1) and 2 (SM2). ). The “intelligent Read Across” tool used in this study is available in Supplementary Materials 3 (SM3).

[bib1] 1.Teow Y., Asharani P.V., Hande M.P., Valiyaveettil S. Health impact and safety of engineered nanomaterials. Chem Commun. 2011;47:7025–7038. doi: 10.1039/C0CC05271J. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Mu Y., Wu F., Zhao Q., Ji R., Qie Y., Zhou Y., et al. Predicting toxic potencies of metal oxide nanoparticles by means of nano-QSARs. Nanotoxicology. 2016;10:1207–1214. doi: 10.1080/17435390.2016.1202352. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Mikolajczyk A., Gajewicz A., Mulkiewicz E., Rasulev B., Marchelek M., Diak M., et al. Nano-QSAR modeling for ecosafe design of heterogeneous TiO2-based nano-photocatalysts. Environ Sci Nano. 2018;5:1150–1160. doi: 10.1039/C8EN00085A. [DOI] [Google Scholar]

[bib4] 4.Wongkaew N., Simsek M., Griesche C., Baeumner A.J. Functional nanomaterials and nanostructures enhancing electrochemical biosensors and lab-on-a-chip performances: recent progress, applications, and future perspective. Chem Rev. 2019;119:120–194. doi: 10.1021/acs.chemrev.8b00172. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Zhang Y., Zhang Y., Akakuru O.U., Xu X., Wu A. Research progress and mechanism of nanomaterials-mediated in-situ remediation of cadmium-contaminated soil: A critical review. J Environ Sci. 2021;104:351–364. doi: 10.1016/J.JES.2020.12.021. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Lu T., Cui J., Qu Q., Wang Y., Zhang J., Xiong R., et al. Multistructured electrospun nanofibers for air filtration: a review. ACS Appl Mater Interfaces. 2021;13:23293–23313. doi: 10.1021/acsami.1c06520. [DOI] [PubMed] [Google Scholar]

[bib7] 7.Bocca B., Caimi S., Senofonte O., Alimonti A., Petrucci F. ICP-MS based methods to characterize nanoparticles of TiO2 and ZnO in sunscreens with focus on regulatory and safety issues. Sci Total Environ. 2018;630:922–930. doi: 10.1016/J.SCITOTENV.2018.02.166. [DOI] [PubMed] [Google Scholar]

[bib8] 8.Bajpai V.K., Shukla S., Kang S.M., Hwang S.K., Song X., Huh Y.S., et al. Developments of cyanobacteria for nano-marine drugs: relevance of nanoformulations in cancer therapies. Mar Drugs. 2018;16:179. doi: 10.3390/MD16060179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Solano R., Patiño-Ruiz D., Herrera A. Preparation of modified paints with nano-structured additives and its potential applications. Nanomater Nanotechnol. 2020;10 doi: 10.1177/184798042090918. [DOI] [Google Scholar]

[bib10] 10.Cheng K., Pan Y., Yuan B. Cytotoxicity prediction of nano metal oxides on different lung cells via Nano-QSAR. Environ Pollut. 2024;344 doi: 10.1016/J.ENVPOL.2024.123405. [DOI] [PubMed] [Google Scholar]

[bib11] 11.Zhang H., Ji Z., Xia T., Meng H., Low-Kam C., Liu R., et al. Use of metal oxide nanoparticle band gap to develop a predictive paradigm for oxidative stress and acute pulmonary inflammation. ACS Nano. 2012;6:4349–4368. doi: 10.1021/nn3010087. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Meng H., Xia T., George S., Nel A.E. A predictive toxicological paradigm for the safety assessment of nanomaterials. ACS Nano. 2009;3(7):1620. doi: 10.1021/NN9005973. [DOI] [PubMed] [Google Scholar]

[bib13] 13.Fröhlich E. Comparison of conventional and advanced in vitro models in the toxicity testing of nanoparticles. Artif Cells Nanomed Biotechnol. 2018;46:1091–1107. doi: 10.1080/21691401.2018.1479709. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Banerjee A., Kar S., Pore S., Roy K. Efficient predictions of cytotoxicity of TiO2-based multi-component nanoparticles using a machine learning-based q-RASAR approach. Nanotoxicology. 2023;17:78–93. doi: 10.1080/17435390.2023.2186280. [DOI] [PubMed] [Google Scholar]

[bib15] 15.Roy J., Pore S., Roy K. Prediction of cytotoxicity of heavy metals adsorbed on nano-TiO2 with periodic table descriptors using machine learning approaches. Beilstein J Nanotechnol. 2023;14:939–950. doi: 10.3762/BJNANO.14.77. 14:77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.U.S. Environmental Protection Agency. Strategic Plan to Promote the Development and Implementation of Alternative Test Methods Within the TSCA Program. 2018.

[bib17] 17.European Chemicals Agency. New approach methodologies in regulatory science. European Chemicals Agency; 2016. https://doi.org/doi/10.2823/543644.

[bib18] 18.Roy J., Roy K. Nano-read-across predictions of toxicity of metal oxide engineered nanoparticles (MeOx ENPS) used in nanopesticides to BEAS-2B and RAW 264.7 cells. Nanotoxicology. 2022;16:629–644. doi: 10.1080/17435390.2022.2132887. [DOI] [PubMed] [Google Scholar]

[bib19] 19.Chatterjee M., Banerjee A., De P., Gajewicz-Skretna A., Roy K. A novel quantitative read-across tool designed purposefully to fill the existing gaps in nanosafety data. Environ Sci Nano. 2022;9:189–203. doi: 10.1039/D1EN00725D. [DOI] [Google Scholar]

[bib20] 20.Ambure P., Gajewicz-Skretna A., Cordeiro M.N.D.S., Roy K. New workflow for QSAR model development from small data sets: small dataset curator and small dataset modeler. integration of data curation, exhaustive double cross-validation, and a set of optimal model selection techniques. J Chem Inf Model. 2019;59:4070–4076. doi: 10.1021/ACS.JCIM.9B00476. [DOI] [PubMed] [Google Scholar]

[bib21] 21.Schultz T.W., Amcoff P., Berggren E., Gautier F., Klaric M., Knight D.J., et al. A strategy for structuring and reporting a read-across prediction of toxicity. Regul Toxicol Pharm. 2015;72:586–601. doi: 10.1016/J.YRTPH.2015.05.016. [DOI] [PubMed] [Google Scholar]

[bib22] 22.OECD . Guidance on grouping of chemicals, OECD Series on Testing and Assessment, No. 1942017. 2nd ed. OECD Publishing; Paris: 2017. [DOI] [Google Scholar]

[bib23] 23.Gini G., Franchi A.M., Manganaro A., Golbamaki A., Benfenati E. ToxRead: A tool to assist in read across and its use to assess mutagenicity of chemicals. SAR QSAR Environ Res. 2014;25:999–1011. doi: 10.1080/1062936X.2014.976267. [DOI] [PubMed] [Google Scholar]

[bib24] 24.Manganelli S., Benfenati E. In: Methods in Molecular Biology. Benfenati E., editor. Springer; 2016. Use of read-across tools Silico methods for predicting drug toxicity; pp. 305–322. [DOI] [PubMed] [Google Scholar]

[bib25] 25.Patlewicz G., Helman G., Pradeep P., Shah I. Navigating through the minefield of read-across tools: A review of in silico tools for grouping. Comput Toxicol. 2017;3:1–18. doi: 10.1016/J.COMTOX.2017.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.QSAR Toolbox, 2025. 〈https://qsartoolbox.org/〉 (accessed May 18, 2025).

[bib27] 27.Patlewicz G., Jeliazkova N., Gallegos Saliner A., Worth A.P. Toxmatch - A new software tool to aid in the development and evaluation of chemically similar groups. SAR QSAR Environ Res. 2008;19:397–412. doi: 10.1080/10629360802083848. [DOI] [PubMed] [Google Scholar]

[bib28] 28.Analog Identification Methodology (AIM) Tool | US EPA n.d. 〈https://www.epa.gov/tsca-screening-tools/analog-identification-methodology-aim-tool〉 (accessed May 18, 2025).

[bib29] 29.Varsou D.D., Sarimveis H. Apellis: an online tool for read-across model development. Comput Toxicol. 2021;17 doi: 10.1016/J.COMTOX.2020.100146. [DOI] [Google Scholar]

[bib30] 30.Low Y., Sedykh A., Fourches D., Golbraikh A., Whelan M., Rusyn I., et al. Integrative chemical-biological read-across approach for chemical hazard classification. Chem Res Toxicol. 2013;26:1199–1208. doi: 10.1021/TX400110F. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Russo D.P., Kim M.T., Wang W., Pinolini D., Shende S., Strickland J., et al. CIIPro: a new read-across portal to fill data gaps using public large-scale chemical and biological data. Bioinformatics. 2017;33:464–466. doi: 10.1093/BIOINFORMATICS/BTW640. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Shah Helman G., Williams I., Edwards A.J., Dunne J., Patlewicz J. G. Generalized Read-Across (GenRA): A workflow implemented into the EPA CompTox Chemicals Dashboard. ALTEX. 2019;36:462–465. doi: 10.14573/ALTEX.1811292. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.AMBIT – Cheminformatics data management system – Cefic-Lri n.d. 〈https://cefic-lri.org/toolbox/ambit/〉 (accessed May 18, 2025).

[bib34] 34.Banerjee A., Chatterjee M., De P., Roy K. Quantitative predictions from chemical read-across and their confidence measures. Chemom Intell Lab Syst. 2022;227 doi: 10.1016/J.CHEMOLAB.2022.104613. [DOI] [Google Scholar]

[bib35] 35.Maggiora G., Vogt M., Stumpfe D., Bajorath J. Molecular similarity in medicinal chemistry. J Med Chem. 2014;57:3186–3204. doi: 10.1021/JM401411Z. [DOI] [PubMed] [Google Scholar]

[bib36] 36.Raymond J.W., Willett P. Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J Comput Aided Mol Des. 2002;16:521–533. doi: 10.1023/A:1021271615909/METRICS. [DOI] [PubMed] [Google Scholar]

[bib37] 37.Good A.C., Richards W.G. Excplicit calculation of 3D molecular similarity. Perspect Drug Discov Des. 1998;9(0):321–338. doi: 10.1023/A:1027280526177. [DOI] [Google Scholar]

[bib38] 38.Petrone P.M., Simms B., Nigsch F., Lounkine E., Kutchukian P., Cornett A., et al. Rethinking molecular similarity: comparing compounds on the basis of biological activity. ACS Chem Biol. 2012;7:1399–1409. doi: 10.1021/CB3001028. [DOI] [PubMed] [Google Scholar]

[bib39] 39.Todeschini R., Ballabio D., Consonni V. Distances and other dissimilarity measures in chemometrics. Encycl Anal Chem. 2015:1–34. doi: 10.1002/9780470027318.a9438. [DOI] [Google Scholar]

[bib40] 40.Urenda J.C., Kosheleva O., Kreinovich V. IEEE 11th International Conference on Intelligent Systems. 2022. Why 1/(1+d) Is an effective distance-based similarity measure: two explanations. [DOI] [Google Scholar]

[bib41] 41.Willett P., Barnard J.M., Downs G.M. Chemical similarity searching. J Chem Inf Comput Sci. 1998;38:983–996. doi: 10.1021/CI9800211/ASSET/IMAGES/LARGE/CI9800211F00002.JPEG. [DOI] [Google Scholar]

[bib42] 42.Cereto-Massagué A., Ojeda M.J., Valls C., Mulero M., Garcia-Vallvé S., Pujadas G. Molecular fingerprint similarity search in virtual screening. Methods. 2015;71:58–63. doi: 10.1016/J.YMETH.2014.08.005. [DOI] [PubMed] [Google Scholar]

[bib43] 43.DistanceMetric — scikit-learn 1.6.1 documentation n.d. 〈https://scikit-learn.org/stable/modules/generated/sklearn.metrics.DistanceMetric.html〉 (accessed May 18, 2025).

[bib44] 44.Bajusz D., Rácz A., Héberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Chemin. 2015;7:1–13. doi: 10.1186/S13321-015-0069-3/FIGURES/7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] 45.Tabak J. Infobase Publishing; 2014. Geometry: the language of space and form. [Google Scholar]

[bib46] 46.Willett P., Barnard J.M., Downs G.M. Chemical similarity searching. J Chem Inf Comput Sci. 1998;38:983–996. doi: 10.1021/CI9800211. [DOI] [Google Scholar]

[bib47] 47.Rodrigues O. Combining minkowski and chebyshev: new distance proposal and survey of distance metrics using k-nearest neighbours classifier. Pattern Recognit Lett. 2018;110:66–71. doi: 10.1016/J.PATREC.2018.03.021. [DOI] [Google Scholar]

[bib48] 48.Foundations of Metric Space Searching. In: Similarity Search The Metric Space Approach. Advances in Database Systems 2006;32:5–66. 10.1007/0-387-29151-2_1. [DOI]

[bib49] 49.De Maesschalck R., Jouan-Rimbaud D., Massart D.L. The mahalanobis distance. Chemom Intell Lab Syst. 2000;50:1–18. doi: 10.1016/S0169-7439(99)00047-7. [DOI] [Google Scholar]

[bib50] 50.Mahalanobis P.C. On the generalized distance in statistics. Sankhyā. Indian J Stat Ser A (2008) 2018;80:1–7. (S) [Google Scholar]

[bib51] 51.Nagar S., Khunteta A. A Proposed Modification Over Learning Vector Quantization and K-Means Algorith ms for Performance Enhancement. Proceedings of the International Conference on Recent Cognizance in Wireless Communication & Image Processing. 2016. pp. 671–682. [DOI] [Google Scholar]

[bib52] 52.Ricotta C., Podani J. On some properties of the Bray-Curtis dissimilarity and their ecological meaning. Ecol Complex. 2017;31:201–205. doi: 10.1016/J.ECOCOM.2017.07.003. [DOI] [Google Scholar]

[bib53] 53.Chen J., Wang C., Sun Y. Shen X. Semi-supervised Laplacian regularized least squares algorithm for localization in wireless sensor networks. Comput Netw. 2011;55:2481–2491. doi: 10.1016/J.COMNET.2011.04.010. [DOI] [Google Scholar]

[bib54] 54.Boser B.E., Guyon I.M., Vapnik V.N. A training algorithm for optimal margin classifiers proceedings of the fifth annual workshop on computational learning theory. 1992. pp. 144–152. [Google Scholar]

[bib55] 55.Zhou D.X., Jetter K. Approximation with polynomial kernels and SVM classifiers. Adv Comput Math. 2006;25:323–344. doi: 10.1007/S10444-004-7206-2/METRICS. [DOI] [Google Scholar]

[bib56] 56.Camps-Valls G., Martín-Guerrero J.D., Rojo-Álvarez J.L., Soria-Olivas E. Fuzzy sigmoid kernel for support vector classifiers. Neurocomputing. 2004;62:501–506. doi: 10.1016/J.NEUCOM.2004.07.004. [DOI] [Google Scholar]

[bib57] 57.Sarwar B., Karypis G., Konstan J., Riedl J. Item-based collaborative filtering recommendation algorithms. Proceedings of the 10th international conference on World Wide Web. 2001. pp. 285–295. [Google Scholar]

[bib58] 58.Shapley L.A. Value for n-person games. Contributions to the theory of games II. Class Game Theory. 1953;1953:31–40. doi: 10.1515/9781400829156-012/HTML. [DOI] [Google Scholar]

[bib59] 59.Lundberg S.M., Allen P.G., Lee S.-I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30:4765–4774. [Google Scholar]

[bib60] 60.Pore S., Roy K. Insights into pharmacokinetic properties for exposure chemicals: predictive modelling of human plasma fraction unbound ( f u) and hepatocyte intrinsic clearance (Cl int) data using machine learning. Digit Discov. 2024;3:1852–1877. doi: 10.1039/D4DD00082J. [DOI] [Google Scholar]

[bib61] 61.Moncada-Torres A., van Maaren M.C., Hendriks M.P., Siesling S., Geleijnse G. Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Sci Rep. 2021;11:1–13. doi: 10.1038/s41598-021-86327-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib62] 62.Rodríguez-Pérez R., Bajorath J. Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions. J Comput Aided Mol Des. 2020;34:1013–1026. doi: 10.1007/s10822-020-00314-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib63] 63.Merugu S., Jagiello K., Gajewicz-Skretna A., Halappanavar S., Willliams A., Vogel U., et al. The impact of carbon nanotube properties on lung pathologies and atherosclerosis through acute inflammation: a new AOP-anchored in Silico NAM. Small. 2025;21 doi: 10.1002/smll.202501185. [DOI] [PubMed] [Google Scholar]

[bib64] 64.Sifonte E.P., Castro-Smirnov F.A., Jimenez A.A.S., Diez H.R.G., Martínez F.G. Quantum mechanics descriptors in a nano-QSAR model to predict metal oxide nanoparticles toxicity in human keratinous cells. J Nanopart Res. 2021;23:1–16. doi: 10.1007/s11051-021-05288-0. [DOI] [Google Scholar]

[bib65] 65.Huang Y., Li X., Xu S., Zheng H., Zhang L., Chen J., et al. Quantitative structure–activity relationship models for predicting inflammatory potential of metal oxide nanoparticles. Environ Health Perspect. 2020;128:1–13. doi: 10.1289/EHP6508. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

“intelligent Read Across (iRA)”- A tool for read-across-based toxicity prediction of nanoparticles

Souvik Pore

Kunal Roy

Abstract

Graphical Abstract

Highlights

1. Introduction

2. Methods

2.1. Read-Across framework

2.1.1. Similarity calculation

Table 1.

2.1.2. Read-Across predictions

Fig. 1.

Fig. 2.

Table 2.

2.1.3. Read-across optimization

2.1.4. Read-Across feature importance analysis

Fig. 3.

3. Results and discussions

3.1. Implementation of the developed software

Fig. 4.

3.2. Case studies

3.2.1. Dataset 1

Fig. 5.

Table 3.

Fig. 6.

3.2.2. Dataset 2

Fig. 7.

Table 4.

Fig. 8.

3.2.3. Dataset 3

Fig. 9.

Table 5.

Fig. 10.

3.3. Comparison with previous studies

Table 6.

4. Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Footnotes

Appendix A. Supplementary material

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases