Abstract
This article aims to provide a comprehensive critical, yet readable, review of general interest to the chemistry community on molecular similarity as applied to chemical informatics and predictive modeling with a special focus on read-across (RA) and read-across structure–activity relationships (RASAR). Molecular similarity-based computational tools, such as quantitative structure–activity relationships (QSARs) and RA, are routinely used to fill the data gaps for a wide range of properties including toxicity endpoints for regulatory purposes. This review will explore the background of RA starting from how structural information has been used through to how other similarity contexts such as physicochemical, absorption, distribution, metabolism, and elimination (ADME) properties, and biological aspects are being characterized. More recent developments of RA’s integration with QSAR have resulted in the emergence of novel models such as ToxRead, generalized read-across (GenRA), and quantitative RASAR (q-RASAR). Conventional QSAR techniques have been excluded from this review except where necessary for context.
Keywords: Molecular similarity, read-across, RASAR, QSAR, predictive toxicology
Introduction
The behavior of chemical compounds expressed as their biological activities, physicochemical properties, and/or toxicities is modulated by their chemistry (Bender and Glen 2004; Maldonado et al. 2006; Maggiora et al. 2014). The presence and arrangement of different chemical functionalities within a molecular structure determine the intramolecular and intermolecular interactions. These govern the type and magnitude of chemical forces playing within and between similar and different types of molecules resulting in differences in physical, chemical, and biological properties. According to Rouvray, “similarity is one of the most instantly recognizable and universally experienced abstracts known to mankind. It is an abstraction that is at once ubiquitous in scope, interdisciplinary in nature, and seemingly boundless in its ramification” (Rouvray 1990). As per the similarity principle, similar compounds (i.e. compounds that are similar in molecular structure) should behave similarly; however, the concepts of similarity paradox and activity cliffs also exist (Nikolova and Jaworska 2003). Although the concept of “similarity” for molecules was originally focused on structural similarity, it is currently applied in a much broader perspective to encompass other contexts such as physicochemical properties, chemical reactivity, ADME (absorption, distribution, metabolism, and elimination) properties, biological similarity, and similarity in toxicological profile.
Chemical descriptors, including molecular fingerprints, are typically relied upon to quantitatively encode chemical structural information (Bender et al. 2009). Such descriptors may also be used as the starting point for calculating different similarity metrics. However, with an extended concept of similarity, biological descriptors arising from high throughput screening (HTS) (e.g. ToxCast) or high content screening (HCS) data including transcriptomics metabolomics data, and high throughput phenotypic screening data, etc. may also be used as inputs for calculating similarity scores (Mangiatordi et al. 2016; Hemmerich and Ecker 2020). With the application of chemometrics and cheminformatics, molecular similarity may be considered not only based on a qualitative hypothesis but also with a more quantitative assessment of the level of similarity for predictions and data gap filling, which is important in the chemical regulatory context, in addition to medicinal chemistry and materials science.
A quantitative structure–activity/property relationship (QSA(P)R) is a supervised learning-based statistical approach to property predictions, while a relatively simpler and non-statistical approach is read-across (RA) which considers chemical structure and/or property-based similarity among chemicals to fill data gaps notably for regulatory purposes (Muratov et al. 2020). It is worth noting that RA is typically performed as part of an expert-driven assessment which creates challenges in terms of reproducibility and regulatory acceptance. The knowledge of structure–activity relationships (SARs) and the identification of structural alerts (SAs), grouped as profilers, can be used to identify chemicals that share a common molecular initiating event (MIE) within an adverse outcome pathway (AOP), and in doing so define a chemical category to facilitate endpoint RA (Cronin and Richarz 2017; Cronin et al. 2017; Benfenati et al. 2019; Fischer et al. 2020). The RA concept has been extended to biological RA considering the biological effects of chemicals using HTS/HCS data (see references Low et al. 2013; Zhu et al. 2016; Russo et al. 2016 as examples). For regulatory purposes, and notably under the European Union Registration, Evaluation, Authorisation and restriction of Chemicals (REACH) regulation, structural similarity alone is not sufficient to justify a RA, especially for more complex human health effects. Further evidence of biological and toxicokinetic similarity is also required (ECHA 2017). Uncertainty considerations within RA predictions within toxicology have allowed for the development of frameworks to characterize and document aspects of uncertainty. Examples include work by Schultz et al. (2015, 2019), Blackburn and Stuard (2014), and Patlewicz et al. (2015). The insights derived from these studies have additionally been taken up in several integrated approaches to testing and assessment (IATA) case studies under the auspices of the OECD work program and captured as part of the OECD template for reporting RA information and will additionally feature as part of the updated OECD Grouping Guidance Document 194 (OECD 2017).
Attempts have also been made to make RA predictions more objective, i.e. to quantify uncertainties and performance such that a supervised modeling component with similarity weighted average predictions has been developed. Examples include chemical biological read-across (CBRA) by Low et al. (2013) and generalized read-across (GenRA) by Shah et al. (2016). Quantification of the contributions of the individual features is not obvious in this type of RA given the limited data. Efforts have been forthcoming to systematically quantify the contribution of different types of similarities, and their impact in RA of toxicity endpoint have been performed by Shah et al. (2016), Tate et al. (2021), and Helman et al. (2018). While statistical modeling approaches like quantitative structure–activity relationships (QSARs) can shed light on this aspect, these are more appropriate when a sufficient number of data points are available to allow a good degree of (statistical) freedom. When the number of source compounds is limited, statistical model-building approaches become less reliable.
By leveraging RA with QSAR principles, different research groups have predicted physicochemical and environmental properties of chemicals in a regulatory context, for example, GenRA from the Patlewicz group (Shah et al. 2016), OPERA from Mansouri et al. (2018), ToxRead from Benfenati’s group (Gini et al. 2014), etc. Recently, the concepts of RA and QSAR have been merged to coin a new acronym for read-across structure–activity relationship (RASAR) which uses statistical and machine learning (ML) model building using similarity descriptors (Luechtefeld et al. 2018). Several similarity and error-based metrics have been suggested which may be used together with the typical structural and physicochemical descriptors for the development of QSAR-like models. These models have been shown to have enhanced external predictivity compared to the corresponding QSAR models developed without using additional or higher-dimensional chemical descriptors (Banerjee and Roy 2023a). This approach has been applied along with several ML applications in different examples (Banerjee and Roy 2023b; Roy and Banerjee 2024) for modeling medicinal chemistry (Kumar, Banerjee, et al. 2024), predictive toxicology (Banerjee and Roy 2023a), nanotoxicity (Banerjee, Kar, et al. 2023), and materials property endpoints (Banerjee, Gajewicz-Skretna, et al. 2023). Based on the current authors’ own research and experience, this article presents to the chemists and toxicologists the applications of molecular similarity to chemical informatics and predictive modeling with a special focus on RA and RASAR. As RA and RASAR offer prediction tools for data gap filling using similarity-based approaches which majorly involve molecular similarity as determined by chemical structure information quantified by chemical fingerprints and molecular descriptors, we discuss below the chemical informatics aspects as the background information.
Chemical informatics
Chemical structure quantification and representation
Graph-based understanding of organic structure was first introduced approximately 150 years ago (Cayley 1874). Subsequent characterization of chemical structure and atomic bonding evolved rapidly with innovations in quantum mechanics, orbital theory, and advancing techniques in microscopy (Weinhold and Landis 2005). When it comes to structural quantification and representation, these breakthroughs offer a mathematical means to quantitate representations of chemical structure. Generally, the choice of representation may be dependent on several factors, such as the objective of the modeling analysis including optimizing for predictive accuracy, consideration of interpretability being more important than performance metrics, size or the data set, etc. Apart from the practical considerations, such as the computational resources available to compute and store the most precise representation, against the competing interest of keeping the calculations simple enough that any meaningful prediction can be made in an acceptable time scale.
At the deepest level, quantum mechanics offers the most precise description of chemical and molecular structures available (McQuarrie 2007). Given so many chemical properties, activities, and behaviors are directly related to their electronic structure, precise solutions to the Schrödinger equations offer the most precise representations with which to attempt any prediction. The clear issue with this choice of quantitation is efficiency. Pragmatically, very few chemicals exist for which this level of theoretical treatment will result in a useful prediction given the limitations of existing computational resources.
Density functional theory (DFT) and its descendent methods of quantum mechanical treatment are quantifications of chemical structures that are used to predict the reactivity of chemicals with known active moieties, such as the DNA-alkylating sites of N-nitrosamines (Kostal and Voutchkova-Kostal 2016, 2023; Wenzel et al. 2022) or toxicant–target interactions (LoPachin et al. 2012). Electronic structure read-across (ESRA) uses the similarity of molecules at the quantum mechanical resolution to infer shared chemical activity (Kostal and Voutchkova-Kostal 2023).
The prohibitive computational complexity of DFT makes it a difficult solution to implement in many workflows (Grimme 2006). Decreasing the precision of the description, and quantitation of molecules as molecular graphs tends to be the most popular approach. The molecular graph, in which the molecule is described by its constituent atoms and their bonds, offers the opportunity to capture many of the structural elements that generate chemical properties and activity (for instance, acidic functional groups can be predicted from the molecular graph) while requiring significantly fewer computational resources to manipulate or compute. The molecular graph enables a means to compute structural motifs, termed fingerprints, which align modeling efforts with the intuition around predictive functional groups (e.g. the presence of an ester likely indicates a higher chance of being susceptible to hydrolytic degradation in aqueous samples).
We note that the molecular graph can be drawn in either three-dimensional space or two-dimensional space and quantified using simple sets of matrices to capture the atomic character and the topological relationships of atoms to one another (Figure 1). One-dimensional descriptors are quantifications of molecules that can include constitutional or bulk properties such as molar mass or the logarithm of the water–octanol partition coefficient () that reduce the molecule to a single value. Two-dimensional descriptors include quantities that take the topological characteristics of the molecular graph and numerically capture atomic connectivity or substructural characteristics. Three-Dimensional descriptors explicitly depend on the placement of atoms within space and can specify the orientation of chiral centers, define characteristics of the conformational pose, or otherwise capture elements of the molecular geometry. Descriptors may also be continuous (e.g. , or radius of gyration) or categorical (e.g. the presence of an atom type, or the count of a functional group). The differing levels of treatment involve different levels of computational complexity and molecular detail. In the next section, we touch on the bounty of chemical descriptors (e.g. molecular connectivity indices, kappa shape indices, E-state indices, extended topochemical atom indices, etc.) in modern use that are derived from manipulations of the chemical graph. Here, we mention that an emerging practice of chemical property prediction is the implementation of deep learning approaches and embedding molecular graphs directly, in which deep learning techniques are leveraged by artificial neural networks to directly relate patterns in the molecular graph to the prediction. While the exact electronic structure is beyond the reach of this choice of representation, the molecular graphs offer a good balance of tractability and robustness of the description.
Figure 1.

An illustration of differing dimensional treatments of atrazine (CAS: 1912–24-9).
Molecular descriptors and fingerprints are most generally described as precise, prescriptive calculations upon a molecular graph that result in single quantities capturing the physical or topological character of the molecule. At the highest level of chemical structure quantification and representation are chemical ontologies, such as ClassyFire (Feunang et al. 2016). These abstract methods of quantifying chemicals typically use higher-resolution descriptions of molecular structure to produce taxonomic descriptions of their relationships to one another, which can be subsequently used to characterize and explore the relationships of broader ranges of structural space to properties and activities of interest. At this level, precise molecular detail begins to be lost, but higher-level patterns may be more easily identified, and thus these high-level descriptions can find good use in prioritizing chemicals for deeper exploration based on their presence in potential risk categories. It is worth noting that this approach to chemical representation has profound connections to the principles exploited by unsupervised representation learning in deep learning architectures.
Because of the deep connections, the representations from deep learning possess the notion of chemical similarity, we leave the discussion of those representations to a section Molecular descriptors and fingerprints. We mention here that practical options for implementing deep-learned chemical representations are seen in the mol2vec model, and the application of autoencoders to generate latent representations of chemical data (Gómez-Bombarelli et al. 2018; Tro et al. 2019).
Molecular descriptors and fingerprints
Owing that the computational resources to manipulate or learn from molecular graphs directly have only been widely available relatively recently, the history of computational chemistry is marked by the development of numerical descriptions, or descriptors, that encode molecular structure by implicitly capturing topology, symmetry, or atomic character without the need for a full graphical representation. Most broadly, a “descriptor” can be any description of a molecule ranging from a simple count of a certain element in its molecular formula to the implementation of relatively complex graph theory to approximate the distribution of electronic charge across its molecular surface (Todeschini and Consonni 2009). Since most algorithms for developing predictive models are predicted on a vector representation for input, much of the literature for QSAR developed throughout the twentieth century focused on the introduction and demonstration of descriptors and their ability to correlate with empirically measured activities.
Structural keys are binary or count-encoded vectors that embed the presence or absence of substructures within the molecular graph. These can be identified from preexisting libraries, as in the case of Molecular ACCess System (MACCs) keys (Durant et al. 2002) or ToxPrints (Yang et al. 2015), or generated as “hashed” keys that apply algorithmic treatment to derive the set of substructures encoded (Table 1).
Table 1.
Structural keys and fingerprints.
| Type | Subtype | Examples | References |
|
| |||
|---|---|---|---|
| Structural key (dictionary fingerprints) | MACCs (Molecular ACCess System) | Durant et al. (2002) | |
| PubChem | Kim et al. (2019) | ||
| ToxPrints | Yang et al. (2015) | ||
| Hashed key | Circular | ECFP | Rogers and Hahn (2010) |
| MHFP | Probst and Reymond (2018) | ||
| Topological | Daylight | Daylight Chemical Information Systems Inc. (2024) | |
| RDKit Descriptors | Brown (2015) | ||
Hashed fingerprints may be generated via topological or neighborhood methods. A topological method, for instance, seen in the popular Daylight fingerprints (Daylight Inc., Laguna Niguel, CA), considers the patterns generated when stepping finite distances across the molecular graph. This exhausts the path patterns within a molecule, generating a unique fingerprint. Figure 2(A) illustrates the calculation of a topological fingerprint, wherein a molecule is decomposed into its unique path patterns and represented as a bit vector. Daylight fingerprints and RDKit fingerprints both use topological approaches to develop their representations of molecular structure.
Figure 2.

Illustration of example processes to generate (A) topological fingerprints and (B) circular fingerprints.
Circular fingerprints iteratively consider the neighborhood of the atoms within a molecular graph. Figure 2(B) illustrates this process for connectivities of two layers. Morgan and MHFP fingerprints base their methodology on this process, with the radius of the neighborhoods considered defining differing sets of fingerprints. Circular fingerprints capture the structural details of the molecular topology by denoting patterns in the presence or absence of subgraphs within molecules, and in doing so can explicitly indicate important functional groups.
Fingerprints identify functional groups which may be interpreted according to a chemist’s intuition. A single fingerprint typically relates directly to a subgraph within the molecular graph and therefore directly encodes a specific atomic connectivity pattern. This property can be very useful when needing to interpret models. MAP4 has recently been introduced as a universal fingerprint that can be used to describe and search the chemical space of drugs, biomolecules, and metabolomes (Capecchi et al. 2020).
The abstract nature of some descriptors can make their interpretation or relevance to models quite difficult, which can sometimes present a challenge to modelers seeking to explain their predictions. Information content descriptors (Roy et al. 1984), the electrotopological state indices (Hall and Kier 1995), and molecular identification numbers (Randic 1984) are all chemical descriptors present in many of the current descriptor calculation software which leverages information or graph theories to develop complex descriptors of molecular character. While these descriptors frequently emerge as predictive of empirical activities (Lowe et al. 2021), they present a challenge in regulatory contexts where transparency and stakeholder communication can be important. This speaks to a general tradeoff that can emerge in the space of molecular description between descriptive power and ease of intuitive interpretation.
The importance of stakeholder communication is emphasized in the Organisation for Economic Cooperation and Development (OECD) validation principles (OECD 2014). Briefly, these principles put forward ideas relating to what validation is required for a QSAR model to be appropriate for usage within a regulatory context. A recent effort (Lowe et al. 2023) attempted to actively apply and interpret these principles to the ML algorithm context, which involved heavy emphasis on the principle of a mechanistic interpretation. Often these mechanistic interpretations are derived from the use of the descriptors within the algorithm, which relates to the idea that interpretable molecular descriptors can lead to more interpretable models. The OECD QSAR Principles have recently been updated and extended as part of the QSAR assessment framework, which emphasizes the need for the transparency and mechanistic interpretability of descriptors (OECD 2023; Barber et al. 2024).
Chemical descriptors may also incorporate the concept from group contribution models, where atomic contributions to bulk properties such as the van der Waals surface area or water/octanol partition behavior are used to approximate the molecular behavior (Wildman and Crippen 1999; Labute 2000). As these properties frequently possess robust relationships with commonly modeled chemical behaviors such as bioaccumulation or environmental fate, these simple descriptors are not uncommon as powerfully predictive features in modern ML models (Mansouri et al. 2018; Lowe et al. 2023).
Molecular similarity and dissimilarity
The concept of similarity is not new to the chemistry community (Couper 1858; Kekulé 1958); for example, Mendeleev’s periodic table recognized the similarities in properties between groups of elements with related atomic weights (Rouvray 1992). An early application of a clustering method to a chemical database was reported by Harrison (1968). The first reports of similarity searching were reported by Lederle Laboratories (Wayne, NJ) and Pfizer (New York, NY) in the mid-eighties (Bawden 1988). However, the field of molecular similarity achieved its full status as a legitimate area of chemical research in the late eighties as reviewed by Willett (2016). By the early 1990s, similarity concepts were applied to property prediction, quantum chemistry, ligand–receptor interactions, computer-aided synthesis design, and the modeling of metabolic pathways (Johnson and Maggiora 1990). The interest in molecular diversity started growing with the development of combinatorial chemistry and high-throughput screening techniques in the nineties, and its early applications were described by Willett (1997).
The modern application of molecular similarity is fundamental to the hypotheses motivating QSAR modeling as well as in RA. Both of these fields relate to the prediction of chemical and biological behavior from structure and operate from the premise that similar molecules will have similar properties and activities. RA involves the use of domain experts to judge what chemicals are “similar” when considering this, while QSA/PR uses algorithms to automatically (Sheffield and Judson 2019) or semi-automatically (Lowe et al. 2023) define similarity.
This speaks to a critical distinction when discussing chemical similarity or dissimilarity: that of a targeted similarity (supervised) and of an untargeted similarity (unsupervised) approach. Targeted similarity is when a chemical comparison is being made with a specific activity in mind. As an example, N-nitrosamines are chemicals, found as impurities in pharmaceuticals as well as food, that possess strong carcinogenic potential. Their defining functional group, however, is insufficient to precisely predict the potency of their toxic activity, and thus a targeted similarity study requires detailed design to understand what molecular representation and descriptors are required to predict this deleterious behavior (Cross and Ponting 2021). Conversely, untargeted similarity efforts such as ClassyFire strive to build chemical ontologies in the absence of a specific behavior in mind (Feunang et al. 2016). These ontologies derive the advantage from the agnosticism that they are more broadly able to serve across intentions.
Naturally, the concept of a targeted versus an untargeted similarity has some connections to supervised versus unsupervised ML. A supervised learning task seeks the features of chemical structure that predict the endpoint of relevance, while an unsupervised task only searches patterns within the input data. Typically, all ML tasks invoke a notion of similarity to relate what it learns from its training data to an external set, which leads to the discussion of latent versus engineered similarity.
| (1) |
When considering chemical similarity, researchers are left to specify, or engineer, how they will quantify chemical similarity within their models. For bitwise or integer representations, the Jaccard similarity is one very common choice of a metric that sees utilization in the GenRA and ToxPrint similarity analysis (Helman, Patlewicz, et al. 2019). In the context of Equation (1), chemical compounds and would typically be sets of features or fingerprints derived from the chemical structures, such as the presence or absence of certain molecular fragments. Euclidean distance (Equation (2)), on the other hand, is a measure of the actual distance between two points in Euclidean space, which is useful for quantifying the (dis)similarity between two chemical compounds represented as points in a multidimensional space (e.g. feature vectors):
| (2) |
where and are two chemical compounds in an -dimensional space, where and are the ith feature values of and respectively.
Other options include the angular, or cosine, similarity; the Dice similarity; and the Manhattan distance (Willett 2014). Generally, while more sophisticated constructions of similarity may provide better results, they may come at the cost of interpretability.
This tradeoff brings us to the notion of latent similarity. It is well-characterized within the representation learning literature (Bengio et al. 2013) that the impressive capabilities of deep learning neural network algorithms are derived from their abilities to procedurally manipulate the input representation such that the relationships relevant to the ML task emerge most naturally. Work using latent similarity has been done to produce deep-learned representations of structural space for chemical design (Gómez-Bombarelli et al. 2018) while transfer learning (Cai et al. 2020) uses large libraries of chemical structure to internalize robust representations and similarities before undergoing fine-tuning on the chemical space of interest that may suffer from lacking data.
Deep learning provides a mechanism to simultaneously find both an effective chemical representation and measure of similarity, however, this both obscures the interpretability of the emerging models and makes them more sensitive to overly emphasizing trends that may be relevant to their data sets while failing to generalize beyond what data was available. The definition of molecular similarity is intimately tethered to the choice of chemical representation, and the success of its outcomes depends heavily on whether the study in question has a targeted goal in mind or wishes to formulate a more general, untargeted ontology.
Molecular similarity measures – general considerations
Molecular similarity plays a pivotal role in predictive toxicity employing chemometrics, and cheminformatics, underpinning methods like QSAR, pharmacophore modeling, and RA. Comparing molecular structures helps identify compounds with similar properties, crucial for predicting chemical behavior, bioactivity, and toxicity.
Methods of structure representation: A chemical structure can be represented differently from 1D to 3D generally, which can be extended up to 7D, which considers real target-based receptor information (Kumar et al. 2019). 1D representation of chemicals, such as a simplified molecular input line entry system (SMILES) string, is a textual representation that depicts the structure of a molecule using a string of ASCII characters, representing atoms and bonds. On the contrary, 2D depictions of the structure of molecules are represented as graphs, where atoms are nodes and bonds are edges (Roy et al. 2015a, 2015b). This representation is useful for comparing molecular structures based on their connectivity. The 3D structures represent the three-dimensional shape of a molecule, including bond angles and torsions. This is important because different conformations of the same molecule can have different magnitudes of response. Therefore, the way of representation can open different types and classes of molecular similarity measures.
Fingerprint-based similarity: Fingerprint-based similarity stands as a computational method employed in cheminformatics to swiftly compare molecular structures, considered under 2D similarity methods. Molecular fingerprints serve as concise depictions of molecules, transforming intricate structural details into a more manageable form, often represented as binary or count vectors. These vectors function akin to a molecular barcode, where each element or position corresponds to a distinct sub-structural characteristic, encompassing features such as rings, bonds, or functional groups (O’Boyle and Sayle 2016).
The Tanimoto coefficient (Bajusz et al. 2015), also referred to as the Jaccard index, serves as a widely utilized statistical measure in cheminformatics to quantify the similarity between two molecular fingerprints, particularly when they are binary. This coefficient compares the proportion of shared features (bits set to 1) to the total number of features present in at least one of the molecules. To calculate the Tanimoto coefficient, given two binary molecular fingerprints, one divides the number of features (bits) present in both molecules (the intersection) by the number of features present in at least one of the molecules (the union). For example, consider two molecules, A and B, with fingerprints:
Molecule A: [1, 0, 1, 1]; Molecule B: [1, 1, 0, 1]
To calculate the Tanimoto similarity:
Count the number of “1” bits in both A and B (intersection): 2 bits (the first and the fourth). Count the “1” bits in either A or B (union): 3 bits (the first, the second, or the fourth). The Tanimoto coefficient () is then:
This yields a Tanimoto coefficient of approximately 0.67, signifying a moderate to high level of similarity between molecules A and B. The coefficient ranges from 0 to 1, where 0 indicates no similarity, and 1 indicates identical fingerprints.
Molecular shape: The 3D structure of a molecule, including its size, orientation, and conformation, is commonly employed in similarity measures. 2D chemical fingerprints are widely used as binary features for the quantification of structural similarity, while 3D information captures more complicated geometric relationships between fingerprint features, including surface area and volumes (Bolcato et al. 2022). In the shape-based methodology, the rapid overlay of chemical structures (ROCS) technique employs the superimposition of three-dimensional structures to assess their shape similarity. This evaluation is vital for comprehending the interaction of molecules with biological systems and determining their toxicological effects (Kearnes and Pande 2016).
Electrostatic properties: Electrostatic properties, including charge distribution and dipole moments, play a crucial role in molecular interactions with biological targets and solvents. These properties are influenced by the molecular structure, leading to regions of positive and negative charges. Electrostatic potential (ESP) maps these charges on the molecule’s surface, highlighting interaction-prone areas. In molecular similarity studies, comparing ESPs can predict toxicological response, as similar ESP patterns suggest similar interactions (Bunin et al. 2006).
Chemical descriptor-based methods: Chemical descriptors quantify molecular characteristics, from simple properties like molecular weight (MW) to complex spatial atom arrangements, enabling computational cheminformatics tasks such as similarity analysis and QSAR modeling (Dong et al. 2015). Molecules are represented as vectors of these descriptors, covering various attributes from physical to topological features. The molecular similarity is determined by comparing these descriptor vectors using metrics like Euclidean and Manhattan distances or cosine similarity. This process occurs in a multidimensional descriptor space, where each dimension represents a different chemical descriptor, offering a comprehensive view of the molecule that includes both basic structural elements and more complex aspects like electronic properties. In this space, similarity is gauged by the proximity of descriptor vectors; a shorter distance indicates higher similarity (Roy et al. 2015a). For instance, with Euclidean distance, similarity inversely relates to the square root of the sum of squared differences between the descriptors of two molecules, providing a quantitative measure of their likeness. Topological indices are numerical values that describe the molecular topology or connectivity patterns within a molecule, disregarding its three-dimensional structure. Indices like the Wiener index, Zagreb indices, and Randic index act as mathematical representations of a molecule’s structure.
Fragment-based approaches: In cheminformatics, fragment-based approaches involve dissecting molecules into smaller units, referred to as fragments. Similarity is then evaluated based on the presence, absence, or arrangement of these fragments within compounds. This method delves into molecular similarity with a focus on the individual components rather than considering the entire molecule. The rationale behind this strategy lies in the significant role that specific fragments or functional groups play in determining the chemical behavior and toxicity of molecules. Analyzing these fragments allows researchers to pinpoint key structural elements contributing to a molecule’s properties. It also helps identify other compounds that, despite having different overall structures, share crucial functional components, suggesting potentially similar toxicity profiles (Kirsch et al. 2019).
Functional groups and structural motifs: Functional groups, distinct atom configurations such as hydroxyl, carbonyl, and amine groups, dictate a molecule’s chemical reactivity and interactions, which are crucial for drug design and SAR studies. These groups’ modification is a key strategy in drug development to enhance efficacy, minimize toxicity, and improve pharmacokinetics. Moreover, functional groups are predictors of biological activity and toxicological response with molecules sharing similar groups often displaying analogous activities. Structural alerts are specific structural motifs within molecules that are known to be associated with toxicological effects. The identification of such alerts in a compound’s structure serves as a warning signal for potential toxicity, prompting further investigation (Enoch et al. 2011; Meanwell 2014).
Solubility: Solubility can enhance similarity measures by providing a quantitative aspect of molecular behavior in biological or chemical systems. Molecules with similar solubility profiles, when combined with other structural or physicochemical descriptors, are more likely to exhibit similar biological activities or chemical properties. Solubility is directly influenced by molecular structure, polarity, hydrogen bonding, and other intermolecular forces, making it a significant indicator of molecular interactions and behavior.
- Machine learning approaches:
- Similarity-based prediction models: Prediction models in cheminformatics, like the k-nearest neighbors (k-NN) algorithm, rely on ML to forecast molecular properties by analyzing the similarity within a feature space defined by chemical descriptors or molecular fingerprints (Niazi and Mariam 2023). In k-NN, a molecule’s characteristics are predicted based on those of its “k” closest neighbors, determined by metrics such as the Euclidean distance or Tanimoto coefficient. This method involves identifying the “k” most similar molecules in a database and inferring the target molecule’s properties from theirs, using methods like voting or averaging.
- Deep learning: Deep learning in cheminformatics advances this by employing neural networks, particularly graph-based models like graph convolutional networks (GCNs), to discern complex patterns in molecular data (Nakamura et al. 2022). Molecules are represented as graphs with atoms as nodes and bonds as edges, allowing GCNs to analyze molecular topology and understand how different molecular segments influence overall properties. Techniques like the maximum common substructure (MCS) highlight shared structural motifs across compounds, aiding in molecule classification and design by pinpointing responsible structural elements for toxicity.
Miscellaneous properties: Compounds’ ability to participate in different types of interactions, for example, hydrogen bonds, van der Waals interactions, ionic interactions, etc., can be a crucial measure of similarity. On the contrary, beyond structural similarity, the evaluation of molecular properties extends to physicochemical characteristics, toxicological profiles, and metabolic fate, offering a comprehensive understanding of a compound’s behavior in biological systems in terms of biological response and toxicity. Toxicological assessments focus on identifying potential adverse effects and guiding the early screening of compounds for safety concerns. Physicochemical properties, such as solubility (discussed earlier), volatility, and partition coefficient, are crucial for understanding a molecule’s stability, absorption, and distribution within an organism. Integrating these aspects into similarity-based models enhances their predictive power, ensuring a holistic evaluation of chemical safety (Radchenko et al. 2017; Giordano et al. 2022).
Visualization and representation of chemical similarity
In cheminformatics, illustrating the similarities among chemicals entails showcasing how chemical structures relate to and differ from each other in a manner that is both easy to understand and insightful. There are several approaches available for this purpose, each with its advantages contingent on the specific needs of the analysis (Riniker and Landrum 2013; López-Pórez et al. 2023). Here are some of the common methods used for visualizing chemical similarity (Figure 3):
Figure 3.

Composite visualizations of chemical similarity. (A) Similarity table: matrix representation of similarity scores between molecules. (B) Network graph: nodes represent molecules with edges indicating similarity, showing clusters of similar molecules. (C) Chemical space map: 2D projection of molecular similarity using t-SNE. (D) Dendrogram: Hierarchical clustering of molecules. (E) Heatmap: color-coded similarity matrix.
Tables: This is one of the simplest and orthodox representations of similarity measures (Figure 3(A)). These offer an organized method to exhibit the similarities and disparities among molecules. Molecules can be represented by each row and column, with the cells indicating similarity scores. This layout is perfect for quantitative studies, facilitating straightforward comparisons across various attributes.
Network-graphs: In graph-based visualizations, molecules are depicted as nodes and their similarities are illustrated through the edges that connect them (Figure 3(B)). The degree of similarity can be shown by either the thickness of these edges or the proximity between the nodes. Additionally, clustering algorithms can be used to pinpoint groups of molecules that are closely related within this network (Zhang et al. 2021).
Chemical space mapping: Maps of chemical space serve as visual tools where each dot represents a molecule, with the closeness of the dots indicating how similar the molecules are to one another (Figure 3(C)). To convert complex, high-dimensional descriptor spaces into simpler 2D or 3D visualizations, methods such as principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) are frequently employed (Gütlein et al. 2012). In most of the cases, unsupervised dimensionality reduction methods are used, although supervised methods like Arithmatic Residuals in K-Groups Analysis (ARKA) have also been reported (Banerjee and Roy 2024a). The chemical space mapping process initiates with the selection of pertinent chemical descriptors, ranging from basic properties like MW and Log P (a measure of lipophilicity) to more intricate descriptors that capture molecular topology or electronic properties. These descriptors act as coordinates, positioning each molecule at a specific point in the multidimensional chemical space (López-Pórez et al. 2023). Due to the typically high dimensionality involved, dimensionality reduction techniques like PCA are frequently employed. PCA simplifies the complexity by transforming the original descriptors into a more concise set of new variables, known as principal components. These components retain most of the variance in the data. This reduction makes it possible to visualize and interpret patterns within the chemical space, such as the clustering of similar molecules.
Dendrograms and cluster trees: Dendrograms and cluster trees are tree-shaped charts that illustrate the configuration of clusters created by hierarchical clustering algorithms (Figure 3(D)) (Morlini and Zani 2012). They are especially valuable for categorizing chemicals according to their similarity and for displaying the tiered organization of these categories.
Heatmaps: Heatmaps can serve to depict similarity matrices, with each grid cell representing the similarity score (such as Tanimoto coefficients) between pairs of molecules (Figure 3(E)). The depth of color in each cell reflects the level of similarity, where warmer hues (e.g. red) denote higher similarity, and cooler hues (like blue) signify lower similarity.
Molecular fingerprints and radar plots: For small groups of molecules, molecular fingerprints can be directly visualized to examine the specific features or substructures they have in common. These diagrams can illustrate various characteristics or descriptors of molecules, such as their similarities, by displaying values along multiple axes radiating from a central point.
Application of the molecular similarity concept in toxicology
The fundamental principle within QSAR, RA, and quantitative read-across structure–activity relationship (q-RASAR) is the notion of molecular similarity, which centers on the idea that molecules with similar structures generally demonstrate comparable properties, such as biological activity. This foundational concept serves as the basis for predicting the activity of chemical compounds by analyzing their structural characteristics. The prediction or identification of the molecular targets of xenobiotics and the association of these targets with perturbed biological pathways involves chemical and biological similarity considerations (Crawford et al. 2017). As the review is dedicated to toxicological aspects, a detailed explanation, complete with applications and examples is provided:
Toxicity prediction in pharmaceutical ingredients: The prediction of a new compound’s toxicity leverages molecular similarity by aligning its structure with those of molecules whose toxicity profiles are thoroughly documented. This method hinges on the assumption that structurally akin molecules will likely manifest comparable toxicological behaviors. By examining structural features and toxicological information of established substances, researchers can deduce the likely toxicity of novel compounds that bear a strong structural resemblance. Such an approach facilitates the early detection of toxic effects during the drug development phase, thereby steering the selection of compounds toward those that are safer and more efficacious (Cavasotto and Scardino 2022; Yang and Kar 2023). Molecules that share specific attributes, such as polarity, size, or electronic properties, tend to demonstrate similar behavior within biological systems. This principle becomes particularly evident when examining molecules with comparable Log P values. The Log P value serves as an indicator of a molecule’s hydrophobicity, signifying its inclination to dissolve in hydrophobic (non-polar) environments as opposed to hydrophilic (polar) ones. This resemblance in molecular properties holds fundamental significance in pharmacokinetics, where it is utilized to anticipate how a drug will undergo processes like absorption, distribution, metabolism, excretion, and toxicity (ADMET) within the body (Kar and Leszczynski 2020, 2021). Understanding these molecular similarities empowers scientists to make informed predictions about a molecule’s pharmacokinetic profile, a critical aspect of drug development and therapeutic effectiveness.
Environmental toxicology: Within the realm of environmental chemistry, predictive models play a crucial role in predicting the potential bioaccumulation, persistence and toxicity of environmental pollutants. This assists in evaluating the risks associated with new chemicals before they undergo mass production. Particularly significant in environmental toxicology, these models utilize molecular descriptors to forecast the toxicity of chemical compounds, considering factors such as toxic functional groups and molecular size. Compounds sharing similar descriptors may demonstrate comparable toxicological profiles, providing essential information for regulatory bodies responsible for assessing chemical safety (Kar et al. 2020).
Chemical safety assessment: The evolving landscape of chemical safety assessment is increasingly characterized by the adoption of new approach methodologies (NAMs), which aim to reduce reliance on traditional animal testing methods (Berggren et al. 2015; Lester et al. 2023). These innovative strategies leverage advances in chemical informatics and predictive modeling to expedite risk assessment processes and minimize resource costs. At the current time, there is no agreed definition of NAMs, although most uses of the terms include a broad number of in silico models (Westmoreland et al 2022). With regard to the use of computational models to identify hazard, QSAR and RA are fundamental, with increasing interest in quantitative read-across (q-RA) which may be extended to the more sophisticated q-RASAR models, the latter being significantly enhanced by ML technologies. Strategically applied in chemical safety assessments, the in silico model allows a nuanced analysis of potential risks associated with chemical exposure, predicting toxicological effects even in the absence of empirical data. This proves valuable when data are limited due to ethical, time, or financial constraints. The developed mathematical models contribute to informed regulatory decision-making by providing a scientific basis for risk management. Regulatory agencies can prioritize testing, focus on high-risk substances, and support actions such as restrictions, classification, and labeling.
Read-across hypotheses
Chemical and structural similarity in read-across
There are several means of identifying similar compounds for RA: (a) the presence of identical functional groups, which leads to chemical resemblance; (b) shared precursors or breakdown products that are generated through physical or biological degradation, resulting in structurally akin compounds; and (3) consistent patterns or trends within the group regarding their physicochemical or biological properties (ECHA 2013).
The similarity in structural features, physical–chemical properties, reactivity, mechanism, metabolism, and/or activities between the source and target chemicals are used, in part at least, as a basis for justifying RA. Read-across involves using the concept of grouping to predict the physical, chemical, human health, and environmental effects of certain substances based on experimental data from similar reference substances within the same group. This method interpolates findings from known (source) substances to other, less understood substances within the group, often called query or target substances. Source compounds might be identified based on similarity metrics (chemical similarity), as well as based on similarity in SAs, potential metabolic precursors, or chemical classes. The analog approach refers to a specific type of RA used when dealing with one analog, or a small group of substances, that may not exhibit clear trends. In its simplest form, this approach involves applying data from a single source substance directly to predict the properties of a target substance, known as a one-to-one RA. Alternatively, a considerable number of compounds in a group may be used as the source for which the term category approach is used (Figure 4). Read-across is used to fill the data gap of a target substance from experimentally available data of chemically and mechanistically similar substances. Some regulatory agencies have begun to endorse these methods, which traditionally rely on the chemical resemblance between the source substance and the compound under investigation (Patlewicz et al. 2024). For example, RA predictions may be accepted by the European CHemicals Agency (ECHA) to provide data information in place of animal tests. RA must be, in all cases, justified scientifically and thoroughly documented (Ball et al. 2016). As per ECHA’s read-across assessment framework (RAAF), one needs to start with the chemical similarity for which several approaches are available based on molecular descriptors, distance/similarity measures, and weighting schemes for specific endpoints. Therefore, one aspect of this can be to utilize scientifically sound and unambiguous algorithms that distinguish between structurally similar and dissimilar molecules for specific endpoints. However, for successful toxicological predictions, attention must be also paid to kinetics and metabolism aspects. Approaches such as plasma metabolomics in rats may further support the predictions. With the availability of large data sets of SARs, expert tools have been developed that provide a deeper insight into the mechanisms of toxicity (Ball et al. 2020). The RA predictions, which may be qualitative or quantitative, are specific to a particular endpoint which may be a physicochemical property, environmental fate, human health effect, and ecotoxicity. In practical applications, RA predictions are particularly valuable for complex endpoints like repeated dose toxicity, reproductive and developmental toxicity, and carcinogenicity, as well as for simpler endpoints like acute toxicity, skin sensitization, mutagenicity, and aquatic toxicity. This approach is favored for its potential to conserve resources, reduce costs, save time, and spare animal lives, provided that high-quality data is accessible (Kovarich et al. 2019).
Figure 4.

Representation of the analog and category approaches in read-across predictions.
Using the knowledge of chemical structures and functionalities, one must assess different aspects such as the chemical stability, structures of possible toxic metabolites, similar functional groups, possible routes of exposure and concentrations at the target tissue, metabolism, etc. to make the RA scientifically justified. It is important to ascertain whether the toxicity occurs due to the effect of the compound or its metabolites. For more confident predictions, one should consider the toxicokinetic fate and toxicodynamic behavior of the substance.
Uncertainties of read-across predictions
Employing chemical categories/groups and RA enables the prediction of properties/toxicities of a target chemical using existing data from related source chemicals. While an analog approach, which relies on a single source material, is feasible, having consistent data from several related chemicals enhances the reliability of the predictions (Gellatly and Sewell 2019). During RA predictions, similarity and uncertainty are two important aspects to consider. While chemical substances may be grouped based on molecular structure and chemical properties, these similarities alone are generally not sufficient to make a justification for RA predictions. Thus, in addition to chemical similarity, one needs to further consider bioavailability, metabolism, and biological/mechanistic aspects for toxicological predictions. The uncertainty aspects may arise from similarity justification and also from the completeness of the RA arguments (Schultz et al. 2015). The RA-derived predictions can be accepted after examining the associated uncertainties. For the regulatory use of the predictions, the data for the endpoint being assessed, the RA hypothesis (mechanistic possibilities, supporting evidence, supporting data, and weight-of-evidence), and the similarity justification (physicochemical property, toxicokinetics, and toxicodynamics) should be linked to the RA arguments (Schultz et al. 2019).
The effectiveness of RA is enhanced with solid biological rationale. Calculating various similarity measures at local/global levels, and understanding the mode of action for toxic responses, boost confidence in predictions. While mechanisms for endpoints like mutagenicity and skin sensitization are well-defined, repeated dose toxicity is less straightforward. In such cases, examining ADME properties and other endpoints aids in making reliable predictions. Key steps include identifying appropriate sources with reliable data and creating detailed tables summarizing information on both source and target substances (Mangiatordi et al. 2016). Schultz et al. (2015) introduced templates to assist in evaluating chemical similarity and to assess the uncertainty regarding the mechanistic relevance and the integrity of RA arguments. ECHA recommends the OECD QSAR Toolbox (Dimitrov et al. 2016), an accessible software tool, to facilitate category-based approaches like RA and trend analysis in hazard evaluations. While the QSAR Toolbox enables the identification of structural characteristics and possible action mechanisms of both target and source chemicals, along with applicable experimental data for RA, it requires users to define chemical category criteria and select pertinent experimental data. As a result, the effectiveness of the outcomes may vary with the user’s expertise (ECHA 2014). Therefore, the proper utilization of the QSAR Toolbox and thorough documentation are essential. ECHA has also released the RAAF to ensure a consistent and structured review of submitted grouping and RA arguments (ECHA 2017). To address the most pertinent decision-making scenarios in RA, four distinct scenarios have been outlined (Berggren et al. 2015): (1) compounds that are chemically similar and can cause potential adverse effects on human health in their original form without being metabolized; (2) chemically similar substances that are metabolized, leading to exposure to the same or similar toxicants; (3) chemicals that typically exhibit low or no toxicity; and (4) chemicals within a structurally similar category that display differing toxicities, distinguished by their underlying mechanisms.
Biological similarity and mechanistic aspects in read-across predictions
Traditional RA, a non-testing strategy for bridging data gaps, relies on analog or chemical category approaches. It demands scientifically valid arguments and considers uncertainties through factors commonly used in weight-of-evidence methods. While quantifying chemical similarity is more straightforward, measuring biological similarity lacks a standardized approach. Utilizing diverse data sources like in vitro screenings, omics, and systems biology, integrated with computational models, can enhance RA predictions and elucidate toxicity mechanisms through high-throughput screenings.
Systems biology helps comprehend the properties of complex biological systems, and systems toxicology offers deep insights into toxicological mechanisms, facilitating the understanding of the AOP and advancing chemical risk assessment paradigms. Mathematical models in systems biology connect experimental data to biological outcomes; although AOP and systems biology differ, their synergy could clarify chemical toxicology mechanisms, potentially reducing RA uncertainties (Aguayo-Orozco et al. 2019).
Strengthening traditional RA, which is based on structural similarities, with NAMs, including in vitro molecular screenings, omics assays, and computational models, could achieve regulatory acceptance. Clarifying the mechanism of action for target substances is crucial in justifying RA for toxicity endpoints, with innovative testing methods likely to bolster RA arguments (Gocht et al. 2015; Pestana et al. 2021; Escher et al. 2022).
Quantitative read-across and machine learning applications: interrelationships between QSAR and RA
Read-across started as a quantitative interpolation of the presence of activity from similar molecules or within definable groups/classes of chemicals (Blackburn and Stuard 2014; Schultz et al. 2019). Attempts are now being made to enable greater quantification of RA allowing for potencies to be RA (Gini et al. 2014; Helman et al. 2018; Banerjee, Chatterjee, et al. 2022; Chatterjee et al. 2022; Li et al. 2024). Since RA is a very important area of research in the modern world of predictive cheminformatics, several research groups have developed their own tools based on the basic theory of RA using a variety of similarity-driven algorithms. GenRA (https://www.epa.gov/comptox-tools/generalized-read-across-genra; last accessed: 23 July 2024) (Shah et al. 2016) is a web-based platform where the query compounds are taken as inputs and various similarity-driven algorithms are employed to find their “n” close source neighbors for the prediction of the query compound. ToxRead (Gini et al. 2014) is a Java-based tool that enables the user to draw the structure of query compounds in the visualization panel, from where various links are connected to different close source compounds, resulting in the toxicity prediction of the query compound. One of the biggest advantages of a Java-based tool is its simple, secure, robust and scalable nature. Apellis (Varsou and Sarimveis 2021) and Deimos (Varsou and Sarimveis 2023) used different algorithms to generate RA-based predictions. Papadiamantis et al. (2021) describe the development of the Isalos Analytics Platform, powered by Enalos + nodes, which utilizes the k-NN algorithm to generate RA predictions. The q-RA tool (Chatterjee et al. 2022) offers a Java-based platform that can provide quantitative and qualitative RA predictions using different similarity functions. Different RA tools are based on their specific algorithms for quantitative similarity considerations and predictions involving appropriate weightage considerations of close source congeners, and their relative suitability may be compared through Round-Robin exercises for the predictions for a given set of query chemicals for a particular endpoint (Benfenati et al. 2016). In the following subsection, a few tools designed for the quantification of RA are described.
GenRA
The development of GenRA was motivated by the desire to transition RA approaches away from expert-driven subjective assessments toward data-driven predictions that would be reproducible and objective. Initial research focused on establishing a baseline in performance in predicting binary in vivo toxicity outcomes as a function of different chemical fingerprints. The in vivo toxicity outcomes encompassed study-level target effects from up to 10 different endpoints including chronic, reproductive, and developmental endpoints that had been captured in the Toxicity Reference Database (ToxRefDB – see Martin and Dix 2009). A similarity-weighted activity from the n nearest neighbors (source analogs) formed the RA prediction. Source analogs were identified based on Jaccard indices, from which the sum of the pairwise similarities multiplied by the toxicity outcomes and divided by the sum of similarities formed the RA prediction. Performance was evaluated using the area under the receiver operating characteristic (ROC) curve. Performance varied across toxicity endpoints and chemicals when aggregated into different clusters. Subsequent research has evolved to quantify the contribution that different similarity contexts play in the in vivo toxicity predictions, i.e. what contribution does similarity in biological profile play relative to structural similarity? Helman et al. (2018) evaluated the contribution that physicochemical properties (namely MW, LogKow (i.e., log P), hydrogen bond acceptor (HBA), and hydrogen bond donor (HBD)) played in the identification of analogs and their associated toxicity predictions. Source analogs were identified in two ways – either analogs were identified based on structural characteristics and then filtered based on their commonality in physicochemical profile or analogs were identified based on structural characteristics and physicochemical similarity at the same time (referred to as search expansion). Performance was found to be improved when a search expansion approach was employed suggesting that physicochemical similarity did play a substantive role in finding analogs and predicting toxicity. The relative weightings of the contribution of physiochemical similarity vs. structural similarity varied across toxicity outcomes and chemicals. That said it was possible to assign a chemical to a specific structural cluster and then based on the endpoint of interest, adjust the relative weightings to return a set of source analogs that would be optimized for performance. Ongoing investigations have explored the contribution of targeted transcriptomic data in identifying analogs (Tate et al. 2021) and extending the approach to predict toxicity potencies using LOAEL data from ToxRefDB as well as LD50 values from rat oral toxicity studies (Helman, Patlewicz, et al. 2019; Helman, Shah, Patlewicz et al. 2019). A more recent focus has been on characterizing ways in which metabolism information can be represented to incorporate a similarity in metabolism component. Given the paucity of metabolism information, efforts have been centered on using predicted metabolism information from a range of different expert tools. Boyce et al. (2022) formed an initial proof-of-concept evaluation of the tools using 37 substances whereas Groff et al. (2024) extended the evaluation to over 100 substances. Efforts to codify the similarities and evaluate their contribution in RA have been explored in more detail in Patlewicz et al. (2024).
At the same time, operationalizing the GenRA approach into a tool was also pursued. The aim here was to align the various steps in a RA workflow (Patlewicz et al. 2018) into a RA tool. Thus, source analogs would first be identified, and data gaps across endpoints and source analogs for the target could be performed. Then, analogs would be evaluated for their consistency and concordance across endpoints and analogs. Finally, a RA prediction would be derived using the similarity-weighted activity formula. As a first iteration, the GenRA approach was implemented as an integral component (see Helman, Shah, Williams, et al. 2019) of the EPA CompTox chemicals dashboard (Williams et al. 2017). A target chemical of interest would first be identified using the database DSSTox underpinning the Dashboard. The landing page of that target chemical would provide the necessary links and information to help evaluate the extent of the data gaps. Clicking on GenRA (Helman, Patlewicz, et al. 2019; Helman, Shah, Patlewicz 2019; Helman, Shah, Williams, et al. 2019) within the landing page would launch the GenRA workflow where an end-user could navigate through the various steps to arrive at a set of in vivo toxicity predictions and download the report as a CSV or XLSX file. The first iteration of GenRA formed a proof-of-concept implementation of how RA predictions could be performed systematically. The GenRA tool has continued to evolve with the current version 3.2 which has seen several refinements, notably a decoupling from the Dashboard and existing as an independent software tool. The specific refinements include the ability to (1) search for analogs based on up to three different fingerprints (so-called hybrid fingerprints), (2) evaluate analogs concerning their predicted physicochemical profile using boxplots, (3) visualize the neighborhood using graph networks as well as (4) make predictions of HTS hit call outcomes in addition to in vivo toxicity (binary and potency outcomes). Version 3.2 also allows for the introduction of chemicals not pre-registered within DSSTox through the use of a chemical drawing palette. The main features as well as how to make predictions are described in Patlewicz and Shah (2023). Version 3.3 due to be released before the end of 2024 offers additional capabilities such as the introduction of physicochemical fingerprints to realize the insights derived from the associated research in Helman et al. (2018). The capability of being able to develop user-specific neighborhoods or category-based predictions will also be a new feature. For more details on GenRA, the reader is referred to the GenRA resource hub at https://www.epa.gov/comptox-tools/genra-resource-hub.
A Python package that forms the engine of the predictions used in the GenRA web application is also available as open-source software. This is described in more detail in Shah et al. (2021) though critically the package enables programmatic access and the ability to make predictions for many chemicals at the same time. It should be noted that preliminary web APIs are available for the GenRA web application itself to facilitate programmatic access.
ToxRead and VERA
There are several issues associated with the practical use of RA. An important point is that in many cases the use of RA can provide different results, depending on the expert; this is due to the subjectivity of the assessment, thus, even starting from the same data and using the same software tool, different results have been achieved (Benfenati et al. 2016). Of course, this lack of reproducibility and the subjectivity of the assessment is a serious issue, and partially it is since the initial steps of the RA process, the human expert decides one plausible hypothesis based on the RA process. This choice eliminates the other options. To overcome this issue, and put all elements of reasoning on the same page, the software ToxRead provides a summary of the different elements of reasoning associated with the target substance for a specific property (Gini et al. 2014). There are tens of modules in ToxRead for the different properties. For each property, the software shows the first N most similar substances linked to the target substance and all the rules/SAs that are associated with the property. For instance, in the case of mutagenicity, ToxRead has more than 800 SAs, and thus it shows those present in the target, plus the N most similar substances containing that SA. In this way, the user has graphically the picture with all the most similar substances, simply based on structural similarity, and those associated with a specific SA. The information on the SA is a fundamental one, and there are both alerts related to toxicity and lack of effect. For the structural similarity, ToxRead applies the same algorithm as in VEGA (www.vegahub.eu; last accessed: 23 July 2024). However, compared to VEGA, ToxRead has a larger database and many more SAs. Furthermore, ToxRead is a research tool, and the user can choose several similar substances. In the case of continuous properties, such as the bioconcentration factor, ToxRead provides additional information, such as graphs of the property values versus Log P. ToxRead runs one substance and does not work in batch mode. No report is provided. More discussion about ToxRead and examples on how to use it a given here: https://www.life-concertreach.eu/resources-item/e-book/
Exploring RA tools in other directions, other programs have been developed within VEGAHUB, and are freely available from that website too (www.vegahub.eu). For instance, ToxDelta investigates, in particular, the differences (and thus not the similarity) between substances, to identify if there are causes of a particular effect in the target substance that may be not present in the source one (Golbamaki et al. 2017). Gadaleta et al. used experimental data from bioassays and information on metabolism as components of RA, in addition to the structural information (Gadaleta et al. 2020), and this approach is at the basis of the Raxpy software (Viganò et al. 2022).
More recently, VERA has been developed, to cope with the limitations mentioned in the case of ToxRead (Viganò et al. 2022). VERA provides a report of the assessment and can work in batch mode. Furthermore, VERA addresses another issue often associated with the RA and the similarity measure: how to identify a threshold. VERA adopts the criterion of membership, thus it clearly describes the conditions for the similarity. Moreover, to cope with the critical issue of the presence of similar substances with opposite labels, it extends the assessment to a cluster, to increase the robustness. In practical terms, VERA recruits a large number of similar substances (all substances with a similarity >0.65, using the VEGA similarity, which is a low threshold not applied for VEGA). Then, the presence of SAs (for instance for carcinogenicity) is searched. Then, clusters of similar substances containing the SA and the molecular groups present in the target substance are generated. Each cluster is labeled as active or not depending on the prevailing activity of the similar substances in the cluster. The results from the different clusters are finally merged. In the case of continuous values (e.g. fish acute toxicity), local in silico models are generated too.
The quantitative read-across tool
The q-RA (Chatterjee et al. 2022) predictions have been made using three different similarity-based approaches (Euclidean distance-based similarity, Gaussian Kernel-based similarity, and Laplacian Kernel-based similarity). The mathematical forms of these similarity measures are presented in the following equations, respectively.
| (4) |
| (5) |
| (6) |
In the above equations, , and are the Euclidean distance, Gaussian Kernel similarity and the Laplacian Kernel similarity, respectively. and are two compounds whose distance/similarities are computed.
The algorithm associated with the computation of q-RA predictions has been integrated into a simple and user-friendly Java-based software tool Read-Across-v4.2.1 freely available from https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home (last accessed: 23 July 2024). This tool takes the input of the source and query set files to generate RA predictions for the query compounds. Although primarily it was designed to generate q-RA predictions, this tool can also generate graded RA predictions (in cases where the training set/source compounds have graded response data). The overall prediction quality for q-RA predictions is evaluated based on the standard external validation metrics, as used in cases of QSAR studies, like , mean absolute error (MAE), and root mean square error of predictions (RMSEP) while metrics like sensitivity, specificity, accuracy, precision, area under the ROC curve (AUC), Matthews correlation coefficient (MCC), Cohen’s kappa, F-measure, and G-means are used for classification-based RA.
This tool was introduced by its application on metal oxide nanoparticles (MeOx NPs) to predict their toxicities (Chatterjee et al. 2022). Three different datasets for the prediction of toxicity imparted by MeOx NPs were collected, namely human keratinocyte cell line (HaCaT) (dataset 1), E. coli (dataset 2), and E. coli in the absence of light (dataset 3) from the previous literature (Gajewicz 2017a, 2017b; Hasse and Klaessig 2018; Santana et al. 2020). All three different datasets can be classified as “small datasets” since the number of compounds present in each dataset is very small. The molecular features used for the RA analysis were derived from the previous literature, and no additional features were computed and selected. Each dataset was divided into training (source) and test (query) sets using the random division approach to ensure an unbiased selection. Similarity values, in the form of normalized Euclidean distance, Gaussian Kernel, and Laplacian Kernel, were computed. Based on the similarity values, up to 10 close source neighbors, were considered into the RA analysis. A similarity-based weightage was assigned to each query compound based on the data of their close source congeners, and the q-RA predictions were generated taking different values of the number of close source compounds, (specific for the Gaussian Kernel-based similarity computation), (specific for the Laplacian Kernel-based similarity computation), distance threshold, and similarity threshold. All these parameters associated with different distance and similarity measures can be collectively termed hyperparameters. On comparison of the results, it was observed that the q-RA predictions generated by Roy’s group superseded the predictive performance of the previously reported nano-QSAR models and the q-RA based on PCA (Gajewicz 2017a, 2017b; Hasse and Klaessig 2018; Santana et al. 2020).
This Read-Across-v4.2.1 tool also computes various compound-specific similarity and error-based measures that can be used to estimate the uncertainty in the RA predictions. One may use these similarity and error measures as descriptors in a QSAR modeling framework (vide infra in section “Stepwise formalism of the development of quantitative RASAR models”). Banerjee, Chatterjee, et al. (2022) demonstrated how the prediction quality of a particular query compound can be evaluated based on the Euclidean distance-based approach. They considered the standard deviation of the activity values of the close congeners (SD_Activity), average similarity of the close congeners (Avg. Sim), coefficient of variation of the close congeners (CVsim), and a concordance measure (g) for assessing the compound-specific quality of prediction. The prediction quality of a query compound for which SD_Activity ≤ 0.75, g ≤ 0.4, Avg. Sim ≤ 0.85, and CVsim ≥ 0.05 can be considered as “very good”. The prediction quality of a query compound can be termed “good” when SD_Activity ≤ 0.75 and any one of the remaining three criteria is met. If any one of the above-mentioned criteria is met, the prediction quality can be termed “moderate” whereas if none of the criteria is met, the prediction quality can be termed “bad”.
A generalized workflow of the q-RA algorithm proposed by Roy’s group has been shown in Figure 5.
Figure 5.

Generalized workflow of the q-RA algorithm of Roy’s group (Chatterjee et al. 2022).
Case studies of read-across predictions
Here, we present representative examples of RA predictions in research. Enoch et al. demonstrated the use of a chemistry-based profiler for covalent DNA binding in creating chemical categories for genotoxicity RA (Enoch et al. 2011). Hewitt et al. explored category formation and RA combined with QSAR models and expert systems for predicting developmental toxicity (Hewitt et al. 2010). Skare et al. examined RA and computational prediction methods for assessing the safety of PEG cocamines (Skare et al. 2015). Chavan et al. implemented a k-NN strategy coupled with RA for predicting chronic toxicity based on acute toxicity data (Chavan et al. 2015). Lizarraga et al. (2019) employed an expert-driven RA approach to assess the non-cancer oral toxicity potential of analogs of p, p′-dichlorodiphenyldichloroethane (p,p′-DDD), a known organochlorine pollutant at contaminated sites in the U.S. Paul et al. (2022) performed RA for the predictions of soil ecotoxicity against Folsomia candida. They identified the list of essential features and developed QSAR models followed by making RA predictions. On comparing the external validation statistics of the QSAR and RA, the authors inferred that the RA approach generated enhanced predictivity. Nath et al. (2023) developed computational models and RA-based predictions to estimate the aquatic toxicity of polychlorinated naphthalenes. They also inferred that the RA-based predictions superseded the quality of predictions of the corresponding QSAR models. Varsou and Sarimveis (2023) used an automated methodology resulting in optimum grouping by using the tool Deimos. They also inferred that by using RA, improved internal and external validation statistics were observed as compared to the primary models, i.e. MLR and LASSO (least absolute shrinkage and selection operator) regression. Chakravarti (2023) adopted RA to assess the carcinogenicity of N-nitrosamines. This RA approach has been used to identify the essential structural features responsible for carcinogenicity in N-nitrosamines, and additionally, this work establishes the acceptable intake of nitrosamine impurities. Foster et al. (2022) employed RA using the GenRA tool for assessing structure-based activity in steroid biosynthesis. They employed a multi-strategy approach for the prediction of bioactivity, one of which was the RA approach. Cronin et al. (2022) employed a mechanistic RA to evaluate SAs responsible for toxicity. In this work, they defined a set of criteria for the uncertainty associated with the SAs responsible for toxicity prediction. Kumar, Kumar, et al. (2024) performed a RA for the chronic toxicity assessment of diverse organic chemicals on Daphnia. They reported an enhanced external prediction quality in the RA approach compared to the conventional QSAR models. In another study, Kumar, Ojha, et al. (2024) performed QSAR and RA to estimate the chronic and sub-chronic toxicities of pesticides against dogs. In this work also, they have reported that the RA predictions had a lower prediction error as compared to the corresponding QSAR models. Apart from these, there are several case studies reporting RA predictions some of which have been captured, evaluated, and reviewed in different articles (Schultz and Cronin 2017; Pestana et al. 2021; Escher et al. 2022; Patlewicz et al. 2024).
Evolution of read-across structure–activity relationships
Stepwise formalism of the development of quantitative RASAR models
One of the main drawbacks of QSAR modeling is that there should be a sufficient number of data points to generate a reliable model. However, instances may occur where the number of data points is limited and one cannot generate reliable QSAR models due to an insufficient degree of freedom. This issue can be addressed effectively by adopting non-statistical approaches like RA which does not require the development of a mathematical model. Therefore, RA is a useful tool to fill data gaps. Although RA is a promising tool for the prediction of query compounds, it does not provide information on the relative/quantitative contributions of the individual structural and physicochemical features. To mitigate the drawbacks associated with QSAR and RA, a recent chemometric approach has been developed that combines the concepts of QSAR and RA to generate mathematical models using RA-derived similarity and error-based descriptors. This new modeling approach has been termed RASAR (Luechtefeld et al. 2018; Banerjee and Roy 2022). Luechtefeld et al. (2018) demonstrated the first recorded application of the RASAR approach using two similarity-based descriptors in a classification modeling framework while Banerjee and Roy (2022) recorded the first application of the RASAR algorithm in a regression modeling framework (q-RASAR) using a series of similarity and error-based RASAR descriptors derived from RA (as described in Section “Quantitative read-across and machine learning applications: interrelationships between QSAR and RA”). These descriptors can be interpreted based on the similarity data of the close source congeners for a particular query compound. Moreover, in a previous study (Banerjee, Gajewicz-Skretna, et al. 2023) where a QSPR model was generated on a small dataset, it was observed that the corresponding q-RASPR (quantitative RA structure–property relationship) model used a lower number of descriptors to generate more predictive models. This increases the statistical significance of the developed model as an ideal model should be developed using a minimal number of descriptors covering the maximum chemical space.
The RASAR descriptors and their significance (Banerjee and Roy 2022, 2023a, 2023c)
One of the prerequisites to generate RASAR/RASPR models is the computation of various similarity and error-based descriptors. Henceforth, we use the term RASAR to include RASPR also. As exemplified in the works of Luechtefeld et al. (2018), the authors considered the similarity values to the closest positive and negative source compounds for each query compound and utilized these two similarity-based descriptors to generate classification models. Later, Banerjee and Roy proposed a total of 18 different RASAR descriptors that can be used for model development (Banerjee and Roy 2022, 2023a, 2023c). All of these descriptors are derived from the similarity and error-based information encoded by the previously selected structural and physicochemical descriptors (Table 2). Banerjee, De, et al. (2022) have shown that a univariate q-RASAR model developed with only RA function has a higher predictivity as compared to the corresponding QSAR model developed using eight descriptors. Like the RA function, the RASAR descriptor is also capable of generating univariate models with enhanced model statistics as compared to the corresponding QSAR models developed using a higher number of descriptors which compromise the degree of freedom (Banerjee, Kar, et al. 2023; Banerjee and Roy 2023c). Together with , AbsDiff serves as an indicator between borderline compounds and definitive positives or negatives. A large AbsDiff value indicates that the query compound has a very close positive source compound and the closest negative compound is located far away in the similarity space and vice versa. An important aspect for QSAR modelers is to efficiently identify the activity cliffs. Such compounds should ideally be removed from the modeling set to generate more robust models. As it is difficult to identify potential activity cliffs before the model development in QSAR, this becomes possible when one computes the RASAR descriptors sm1 and sm2 (Banerjee–Roy similarity coefficients 1 and 2). It is important to note that one does not necessarily need to do q-RASAR modeling to identify activity cliffs using sm1 and sm2, as the values of these two descriptors will automatically indicate whether a particular compound is an activity cliff or not. Moreover, these two RASAR descriptors identify the modelability of a dataset based on the presence of activity cliffs. Another important RASAR descriptor (originally developed to explore classification RASAR modeling (c-RASAR) (Banerjee and Roy 2023c) is gm_class. This is an indicator variable having binary values (0 or 1). When the value of gm is greater than 0, the descriptor gm_class becomes 1. Similarly, when the value of gm is less than 0, gm_class becomes 0. As a general observation, it should be noted that the contribution of most of the RASAR descriptors depends on the data structure of the training set (source compounds). It should also be noted that descriptors like SD_Activity, SE, and CVact should not be taken into consideration for c-RASAR modeling since these descriptors relate to the dispersion of the observed activity values of close congeners which should only be applicable for a regression-based approach (q-RASAR).
Table 2.
List of RASAR descriptors.
| Sl. | RASAR descriptor | Significance |
|---|---|---|
|
| ||
| 1 | RA function | The descriptor RA function encapsulates the information from all the different selected structural and physicochemical descriptors. |
| 2 | SD_activity | This is a measure of the dispersion of the response values of the close source congeners for each query compound, measured in terms of weighted standard deviation. |
| 3 | CV_activity | This is the coefficient of variation of the response values of close source compounds |
| 4 | SE | This is the standard error of the response values of the close source congeners for each query compound |
| 5 | Avg.Sim. | This refers to the average similarity values of the close congeners |
| 6 | Pos.Avg.Sim. | This refers to the average similarity values of the close positive/active congeners |
| 7 | Neg.Avg.Sim. | This refers to the average similarity values of the close negative/inactive congeners |
| 8 | SD_similarity | This is the dispersion in the similarity values among the close source congeners |
| 9 | CVsim | This is the coefficient of variation of the similarity values of close source compounds |
| 10 | MaxPos | This represents the similarity of the query compound to the closest positive/active compound |
| 11 | MaxNeg | This represents the similarity of the query compound to the closest negative/inactive compound |
| 12 | AbsDiff | This is an indicator of the absolute difference between the similarities of the query compound to the closest positive source compound and the closest negative source compound. |
| 13 | gm [Banerjee-Roy concordance coefficient] | This identifies the propensity of a query compound to become positive or negative. |
| 14 | gm* Avg.Sim | The product of gm and Avg.Sim. |
| 15 | gm*SD_Similarity | The product of gm and SD_Similarity |
| 16 | gm_class | This is an indicator variable that further reduces the unambiguity in the estimation of the propensity for a query compound to be positive or negative. |
| 17 | sm1[Banerjee-Roy similarity coefficient 1] | This may be helpful in identifying activity cliffs |
| 18 | sm2 [Banerjee-Roy similarity coefficient 2] | This may be helpful in identifying activity cliffs |
The algorithm of q-RASAR/c-RASAR modeling
As stated previously, the RASAR approach integrates the concepts of RA within a statistical modeling framework. The computation of similarity-based RASAR descriptors is done only after the efficient selection of the structural and physicochemical descriptors by the process of feature selection – an integral step in the QSAR model development. The selected features are then used to define similarity among compounds, and computation of various similarity and error-based measures is done. While 0–2D QSAR descriptors are computed based on the whole dataset before the dataset division, the RASAR descriptors are computed separately for the training and test sets. A simple Java-based RASAR descriptor calculation software (RASAR-Desc-Calc-v3.0.2) is freely available from https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home (last accessed: 23 July 2024) which takes the training and query set files as inputs along with certain hyperparameters associated with the three different distance/similarity measures, namely Euclidean distance, Gaussian Kernel similarity, and Laplacian Kernel similarity. The user may want to proceed with the default setting of the hyperparameters (, number of close training compounds = 10, distance threshold = 1, similarity threshold = 0) or perform optimization of these hyperparameters based on RA using the training set compounds only. This optimization is done by dividing the training set into sub-training and validation sets and performing a system-driven (automatic) or manual grid search operation to identify the set of hyperparameters generating the best RA prediction for the validation set. This optimized set of hyperparameters associated with the best similarity measure is then used to generate RASAR descriptors. It should be noted that the tool RASAR-Desc-Calc-v3.0.2 computes RASAR descriptors for the query set compounds (which may be the test set, training set, or even a true external set) based on a standard reference set (training set). The required input files have been shown in Table 3 where examples of training, test, and true external set files have been presented.
Table 3.
Intended scenarios and the required list of input files for the computation of RASAR descriptors.
| Scenario | Training set | Query set |
|---|---|---|
|
| ||
| Computation of the test set RASAR descriptors | Train.xlsx | Test.xlsx |
| Computation of the training set RASAR descriptors | Train.xlsx | Train.xlsx |
| Computation of the true external set RASAR descriptors | Train.xlsx | External.xlsx |
It is essential to note that while computing the RASAR descriptor for the training set (i.e. when both the inputs of the training and query set are the training set itself), there arises a special case when one particular query compound identifies itself among the list of close source neighbors based on similarity considerations. In this case, the tool efficiently identifies and omits the identical training compound present in the list of close source neighbors for a particular query compound. This algorithm has been termed the leave-same-out (LSO) approach, and it helps to eliminate bias and overfitting in the computation of the RASAR descriptors (Banerjee, Kar, et al. 2023).
Once the RASAR descriptors are computed for both the training and test sets, there is an optional step to club the RASAR descriptors with the previously selected structural and physicochemical features to obtain a complete descriptor pool (data fusion) (Banerjee and Roy 2023d). Suitable feature selection algorithms can be employed on this resulting pool to develop q-RASAR models, which can be used for the efficient prediction of the test and true external set compounds. However, in one of the previous studies (Banerjee and Roy 2023c), feature selection was done exclusively using the RASAR descriptors, and data fusion was not performed. This led to the generation of models constituting only RASAR descriptors. Although the need for data fusion has already been explained in one of the previous publications (Banerjee and Roy 2023a), it seems that data fusion is an integral step for q-RASAR modeling, while it can be avoided in the case of developing c-RASAR models that deal with graded response data (Banerjee and Roy 2023c). A generalized workflow for calculating the RASAR descriptors and development of q- RASAR/c-RASAR models has been shown in Figure 6. RASAR models have so far been developed (Sun et al. 2023) disregarding stereochemistry of molecules, which can be dealt with 3D and chiral descriptors. If the molecular structure description is done using such descriptors, it is also possible to develop RASAR models that deals with stereochemistry.
Figure 6.

Generalized workflow for the development of q-RASAR/c-RASAR models.
Analysis of the applicability domain – detection of prediction confidence outliers using the DTC plot
From a statistical point of view, no model can generate efficient and reliable predictions for all of the compounds present in the dataset. Typically, a mathematical model encodes a certain chemical space represented by the constituting descriptors. A compound that is structurally dissimilar from the training compounds and whose structural properties are not properly reflected by the model descriptors can be termed a structural outlier, and it does not fit into the applicability domain of the model. However, there may be cases where the structure of a compound is properly reflected by the model but due to some other factors, efficient and reliable predictions cannot be obtained. Such compounds can be termed prediction outliers.
In this q-RASAR approach, using the knowledge of RA and the location of the close source congeners, it is possible to identify the prediction confidence outliers from the test set. This is done using the DTC Plot, which is a bubble plot representing three different types of information. The X-axis represents the query compounds arranged according to the decreasing order of their highest similarity values with their closest neighbor. The Y-axis is a measure of the dispersion of the response values of the closest source compounds by representing the absolute difference in the response values of the first two closest source compounds for each query compound. Likewise, the bubble diameter is a measure of the dispersion of the similarity values of the closest source compounds by representing the absolute difference in the similarity values of the first two closest source compounds for each query compound (Figure 7). According to the principle of similarity, two compounds having similar structural and physicochemical properties are expected to possess similar response values. Since RA and RASAR are similarity-derived concepts, the compounds must obey the principle of similarity for the effective application of the RASAR models. The closest source compounds for a particular query compound having a large difference in their similarity values are expected to possess a large difference in their response values as well. Therefore, in the light of the DTC plot, a compound having a larger bubble diameter is expected to be located higher along the Y-axis. Similarly, a query compound having a smaller bubble diameter is expected to be located lower on the Y-axis. Additionally, there may be instances that the value of the similarity to the closest source compound for a particular query compound is very low. This signifies that the particular query compound does not have enough close source congeners, and the impact of the RASAR descriptors for such compounds is low. However, there might be cases where the difference in the similarity of the first two closest source neighbors for a particular query compound is relatively high, but the difference in their experimental response values is low. These compounds do not obey the principle of similarity, and thus the corresponding query compound can be considered as a prediction confidence outlier. Typically, such compounds can be identified from a considerable bubble diameter having been located higher on the Y-axis. Therefore, in ideal q-RASAR modeling practice, it is essential to identify and remove such prediction confidence outliers from the test set before performing external validation of the model.
Figure 7.

A sample DTC plot representing the prediction confidence outliers.
Case studies of q-RASAR modeling for selected toxicity and ecotoxicity endpoints
The concept of q-RASAR modeling has been applied to various toxicities by different research groups (Banerjee and Roy 2022; Banerjee, De, et al. 2022; Banerjee and Roy 2023c, 2023d, 2023e) and ecotoxicity (Banerjee and Roy 2023a; Banerjee, Kar, et al. 2023; Chatterjee et al. 2023; Chatterjee and Roy 2023; Chen et al. 2023; Ghosh et al. 2023a, 2023b; Pandey and Roy 2023; Sobańska 2023; Yang et al. 2023; Banjare et al. 2024; Das et al. 2024; Gallagher and Kar 2024) endpoints in different case studies, some of which are presented in Appendix 1. The updates on the application of the q-RASAR/c-RASAR approach are available from https://sites.google.com/site/kunalroyindia/home/rasar (last accessed: 23 July 2024).
Research gaps and opportunities to address the limited regulatory acceptance
With the application of chemometrics and cheminformatics, molecular similarity is considered with a quantitative treatment for predictions and data gap filling, which is important in the chemical regulatory context, in addition to medicinal chemistry and materials science. A statistical approach to property predictions is a quantitative structure–activity/property relationship (QSAR/QSPR) while a relatively simpler and non-statistical approach is chemical RA. The concept of RA has not yet been fully utilized for data gap-filling with the quantitative consideration of similarity levels. The regulatory application of RA as per the ECHA RAAF depends on a qualitative or semi-quantitative consideration of chemical and toxicokinetic similarity. There is scope for a quantitative treatment of similarity consideration in regulatory predictions of data. The application of RA in the QSAR Assessment Framework (OECD 2023) is rather new, and this approach has shown some improvement in the quality of predictions. Recently, the concepts of QSAR and RA have been amalgamated to develop a new concept of q-RASAR which uses statistical and ML model-building using error and similarity descriptors to enhance the quality of predictions for external compounds. As both RA and QSAR are accepted methods for data-gap filling in regulatory settings, there is the possibility of the application of hybrid approaches like q-RASAR in regulatory toxicology for enhanced precision in predictions. Further research on the selection of the optimum similarity criteria and associated hyperparameter optimization is warranted. The do’s and don’ts during RASAR model development have recently been reviewed (Banerjee and Roy 2024b).
Overview
Despite advancements, there remains a critical need for further research to refine the methodologies for defining similarity, selecting optimum similarity criteria and their appropriate quantitation through the hyperparameter optimization necessary to support the development of in silico predictive models. Addressing these gaps can significantly improve the regulatory acceptance of cheminformatics tools, thus facilitating more efficient and accurate chemical safety assessments. Supporting RA with greater quantitative assessment would be one step to gaining greater acceptance and expanding its utility. Not only that, q-RASAR models improve the regulatory decision processes when experimental data are limited to extend their applications across various scientific disciplines, thereby contributing to the advancement of public health and environmental safety.
Acknowledgments
KR thanks Dr. Khac-Minh Thai of the Vietnam National University, Ho Chi Minh City, Vietnam for reading the manuscript and providing useful suggestions. AB and KR thank the Life Science Research Board (LSRB, DRDO), New Delhi for financing a q-RASAR project. The anonymous reviewers selected by the Editor are acknowledged for their insightful comments which have helped to improve the quality of this review.
Abbreviations:
- ADME
absorption, distribution, metabolism, and elimination
- AOP
adverse outcome pathway
- AUC
area under the ROC curve
- CBRA
chemical biological read-across
- DFT
density functional theory
- ECHA
European CHemicals Agency
- ESP
electrostatic potential
- GCN
graph convolutional networks
- GenRA
generalized read-across
- HBA
hydrogen bond acceptor
- HBD
hydrogen bond donor
- HCS
high content screening
- HTS
high throughput screening
- IATA
integrated approaches to testing and assessment
- k-NN
k-nearest neighbors
- LDA
linear discriminant analysis
- LSO
leave-same-out
- MACCs
Molecular ACCess System
- MAE
mean absolute error
- MCC
Matthews correlation coefficient
- MCS
maximum common substructure
- ML
machine learning
- MLR
multiple linear regression
- MW
molecular weight
- NAM
new approach methodologies
- OECD
Organisation for Economic Cooperation and Development
- PCA
principal component analysis
- PLS
partial least squares
- q-RA
quantitative read-across
- q-RASAAR
read-across structure activity–activity relationship
- q-RASAR
quantitative read-across structure–activity relationship
- QSA(P)R
quantitative structure–activity/property relationship
- RAAF
read-across assessment framework
- RA
Read-across
- RASAR
read-across structure–activity relationship
- REACH
Registration, Evaluation, Authorisation and restriction of Chemicals
- ROC
receiver operating characteristic curve
- ROCS
rapid overlay of chemical structures
- SAR
structure–activity relationship
- SMILES
simplified molecular input line entry system
- t-SNE
t-distributed stochastic neighbor embedding
- UMAP
uniform manifold approximation and projection
Appendix
Representative case studies on q-RASAR modeling for different toxicity and ecotoxicity endpoints
Case study 1: application of q-RASAR to predict the binding affinity of endocrine disruptor chemicals (EDCs) to the androgen receptor
This work marks the first application of q-RASAR modeling (Banerjee and Roy 2022). After obtaining the essential structural and physicochemical features, the authors generated read-across predictions. Treating the computed similarity and error-based measures as descriptors, the authors clubbed them with the originally selected QSAR descriptors to obtain a complete descriptor pool. This pool was subjected to feature selection, and various partial least squares (PLS) q-RASAR models were developed. It is essential to note that the external predictivity of the q-RASAR models was higher as compared to the corresponding QSAR and read-across approaches. This paper also introduces the concordance measure gm and demonstrates its importance statistically.
Case study 2: application of q-RASAR introducing the descriptor RA function
This is an extension of case study 1 where the authors proposed the descriptor RA function and used it to develop a univariate q-RASAR model (Banerjee, De, et al. 2022). Although this model was developed using a single descriptor, it possessed enhanced predictive ability than the corresponding QSAR model developed using eight descriptors. Moreover, univariate models possess high statistical significance, and this univariate model has been developed on a diverse training set which further increases its statistical significance.
Case study 3: application of q-RASAR in the prediction of cardiotoxicity imparted by pharmaceuticals
This study is an application of the q-RASAR approach in the field of pharmaceuticals (Banerjee and Roy 2023d). The authors worked on a diverse set of pharmaceuticals and drug-like substances to predict their hERG K+ channel inhibitory activity resulting in cardiotoxicity. After performing feature selection of the structural and physicochemical descriptors, the authors computed RASAR descriptors using the default setting of the hyperparameters. Data fusion was performed by clubbing the structural and physicochemical features with the computed RASAR descriptors to generate a complete descriptor pool. Further feature selection of this pool resulted in the generation of the q-RASAR model. Eight additional machine learning (ML) q-RASAR models were also generated. The ML modeling algorithms adopted were ridge regression, linear support vector regression, support vector regression, random forest regression, gradient boosting regression, adaptive boosting regression, multilayer perceptron regression, and k-nearest neighbor regression. A comparison of the PLS q-RASAR model with the corresponding and previously reported QSAR models showed that the q-RASAR model possessed enhanced robustness and predictivity. This study also introduced the DTC Plot for the identification of the prediction confidence outliers.
Case study 4: quantitative prediction of skin sensitization by the application of q-RASAR modeling
This is an application of q-RASAR to model diverse organic chemicals eliciting skin sensitization (Banerjee and Roy 2023e). The purpose of this work was to generate a “global” model that is capable of handling diverse structural information. The list of essential QSAR descriptors was obtained by applying feature selection algorithms. These features were then used to compute the RASAR descriptors using the default setting of the hyperparameters. A complete descriptor pool was generated clubbing the initially selected QSAR descriptors and the RASAR descriptors. This pool was subjected to feature selection to generate a PLS q-RASAR model. Additional ML q-RASAR models were generated to compare the predictive ability. This work also infers that the q-RASAR models generate enhanced external predictivity as compared to the corresponding QSAR model.
Case study 5: qualitative prediction of skin sensitization by the application of c-RASAR modeling
This is an application of classification-based RASAR (c-RASAR) to predict skin sensitization (Banerjee and Roy 2023c). After suitable feature selection, a linear discriminant analysis (LDA) QSAR model was generated consisting of 14 different structural and physicochemical descriptors. These descriptors were used to compute the RASAR descriptors. However, unlike the previous cases, data fusion was not performed, and instead, a suitable feature selection algorithm was employed exclusively on the RASAR descriptors. This resulted in the development of trivariate and univariate models with improved prediction quality, improved robustness, and improved statistical significance (since the 3 descriptor and 1 descriptor c-RASAR models showed better performance than the 14 descriptor QSAR model). Additional ML c-RASAR models were also generated to compare the predictive ability. Through a retrospective analysis, this work also proposed three new RASAR descriptors namely and .
Case study 6: application of q-RASPR to estimate the ecotoxicological effects of pesticide residues present in foods and vegetables
This study reports the prediction of the retention time of organic pesticides in foods and vegetables (Ghosh et al. 2023a). Although this is a property-based endpoint, it is directly correlated to the lipophilicity of the molecules, which imparts toxicity. After suitable feature selection using a genetic algorithm taking MAE as the fitness function, a PLS QSPR model was developed. These descriptors were then used to compute the RASPR descriptors using the default setting of the hyperparameters. A complete descriptor pool was generated taking the RASAR descriptors and the initially selected structural and physicochemical descriptors. This pool was subjected to feature selection to generate a q-RASPR model. Comparing the predictive ability of the q-RASPR model with the corresponding QSAR model and the previously reported model, it was observed that there has been an increased predictive ability of the q-RASPR model as compared to the QSPR models.
Case study 7: application of q-RASAR to estimate the ecotoxicological effects of pesticide mixtures in honey bees
This study reports the application of q-RASAR to predict the toxicity imparted by pesticide mixtures in honey bees (Chatterjee et al. 2023). The efficient identification of the structural and physicochemical features was done using multiple feature selection algorithms, which was followed by the development of a QSAR model. The RASAR descriptors were computed using the selected QSAR descriptors. Data fusion was performed, and PLS q-RASAR models were generated after suitable feature selection. This was followed by the generation of various other ML q-RASAR models to compare the predictive ability. This work also reports the enhancement in the external prediction quality of the q-RASAR models as compared to the previously reported QSAR models.
Case study 8: first report of the quantitative read-across structure activity–activity relationship (q-RASAAR) – a case study on the estimation of toxicity imparted by antibiotic mixtures in three different bacterial species
This study reports the application of q-RASAAR modeling to predict the toxicity imparted by antibiotic mixtures in three different bacterial species (Vibrio fischeri, Escherichia coli, and Bacillus subtilis) (Chatterjee and Roy 2023). PLS QSAR and QSAAR models were generated after the identification of the essential features by a multilayered feature selection algorithm. The RASAR descriptors were computed after optimization of the hyperparameters based on the training sets. Data fusion was performed (merging the selected QSAR descriptors with the RASAR descriptors), and suitable feature selection algorithms were applied to generate q-RASAAR models. These models possessed the best prediction quality as compared to the QSAR, QSAAR, and the previously reported QSAR models.
Case study 9: application of q-RASAR to predict the toxicity of polycyclic aromatic hydrocarbons in Pimephales promelas
Representing another application of q-RASAR in predictive toxicology, Chen et al. (2023) worked on the assessment of toxicity of polycyclic aromatic hydrocarbons in Pimephales promelas. They developed multiple linear regression (MLR) models using the ordinary least squares (OLS) method. Read-across-based predictions were also developed using the tool Read-Across-v4.1. The authors also computed the RASAR descriptors and performed data fusion to obtain a descriptor pool. Feature selection was performed on this descriptor pool, which was used to develop MLR models. The authors inferred while the read-across predictions possessed the highest predictivity, the q-RASAR models showed enhanced robustness as compared to the corresponding QSAR model.
Case study 10: application of q-RASAR in the prediction of toxicity of organic chemicals on Gammarus species
This work demonstrates the application of q-RASAR to predict the toxicity of chemicals on Gammarus species (Yang et al. 2023). After the selection of essential features and generation of an MLR QSAR model, the authors computed the RASAR descriptors and developed a q-RASAR model. They inferred that the q-RASAR model provided similar statistical performance as compared to the corresponding QSAR model.
Case study 11: application of q-RASAR in the assessment of teratogenicity of pesticides during pregnancy
Another example where the q-RASAR algorithm has been successfully applied is the estimation of the teratogenicity of pesticides during pregnancy (Sobańska 2023). The author initially generated a PLS QSAR model and later computed the similarity and error-based RASAR descriptors to develop a q-RASAR model. It was reported that the PLS q-RASAR model possessed enhanced robustness and predictivity as compared to the corresponding QSAR model.
Case study 12: application of q-RASAR in aquatic toxicology – assessment of the toxicity of organic pesticides on different fish species
This work reports the application of q-RASAR in the domain of aquatic toxicology (Ghosh et al. 2023b). Feature selection was performed and PLS models were generated. These features were used to compute the RASAR descriptors, following which, data fusion was performed. The employment of a suitable feature selection algorithm led to the development of PLS q-RASAR models. The authors inferred that the PLS q-RASAR model possessed enhanced predictivity as compared to the PLS QSAR models.
Case study 13: application of c-RASAR to predict mutagenicity
This study reports another application of the c-RASAR approach (Pandey and Roy 2023). An LDA QSAR model was generated after the identification of the essential features. This was followed by the computation and feature selection of the RASAR descriptors, which led to the development of LDA c-RASAR models. Additional ML c-RASAR models were generated to compare the predictive ability. The comparison showed that the c-RASAR model provided enhanced predictivity in comparison to five different expert systems.
Case study 14: application of q-RASAR in nanotoxicology
This work reports the application of q-RASAR to predict the cytotoxicity imparted by hybrid TiO2 nanoparticles (Banerjee, Kar, et al. 2023). Using the same features from the previously reported model by Puzyn’s group (Mikolajczyk et al. 2019), the RASAR descriptors were computed after optimization of the hyperparameters. Data fusion was performed, and a q-RASAR model was developed after feature selection. This was done for another nanotoxicity dataset to generalize the findings. In both cases, simple univariate/bivariate q-RASAR models possessed enhanced predictivity as compared to the previously reported QSAR models.
Case study 15: application of q-RASAR on various toxicity datasets
This study reports the application of q-RASAR on five different toxicity datasets (androgen receptor binding affinity in rats, acute oral toxicity of pesticides against bobwhite quail, acute contact toxicity of plant protection products against honeybees, acute oral toxicity of pesticides against mallard duck and inhalation toxicity of volatile organic molecules) for which previously reported QSAR models were already available (Banerjee and Roy 2023a). Using the same amount of chemical features and the default setting of the hyperparameters, RASAR descriptors were computed. This was followed by data fusion and the development of q-RASAR models after feature selection. Interestingly, in each case, the prediction quality of the developed MLR q-RASAR models superseded the prediction quality of the previously reported QSAR models. Additionally, various ML models were also generated to compare the predictive ability, and interestingly, the best predictive quality was demonstrated by linear models (MLR in most cases).
Case study 16: application of q-RASAR for the assessment of toxicity of chemicals to Labeo rohita
This study represents another application of q-RASAR modeling in aquatic toxicology (Gallagher and Kar 2024). Dataset division and feature selection were performed to generate a PLS QSAR model. These selected descriptors were used to compute the RASAR descriptors. A complete data pool was generated merging the selected QSAR descriptors with the RASAR descriptors. This pool was subjected to feature selection to develop an MLR q-RASAR model. The authors concluded that the q-RASAR model generated enhanced external predictivity as compared to the corresponding QSAR model.
Case study 17: application of q-RASAR for the assessment of soil degradation property of veterinary pharmaceuticals
This is another recent application of q-RASAR on an environmental endpoint: soil degradation property of veterinary pharmaceuticals (Banjare et al. 2024). At first, the modelability of the initial set of compounds was checked using the Banerjee–Roy similarity coefficients (sm1 and sm2) (Banerjee and Roy 2023c). From this analysis, the authors excluded two compounds from the dataset due to their poor modelable nature. This dataset was divided into training and test sets, and MLR models were generated using the genetic algorithm feature selection technique. This descriptor combination was used to compute RASAR descriptors, which were pooled with the initially selected QSAR descriptors. Finally, the q-RASAR models were developed after suitable feature selection from the data pool. In this case also, the authors inferred that although the internal validation statistics of both the QSAR and q-RASAR models were similar, the external validation metrics of the q-RASAR models were much higher signifying enhanced predictivity.
Case study 18: application of q-RASAR for the acute toxicity of tire wear particle-derived compounds to Chlorella vulgaris
Rainwater and urban runoff can introduce tire wear particles (TWPs) to the aquatic ecosystems while additives in TWPs can cause toxicities to aquatic organisms. Jiang et al. (2024) applied QSAR and q-RASAR for modeling the acute toxicity of TWP-derived compounds to Chlorella vulgaris. The results indicate that the q-RASAR models outperformed the QSAR models in both internal and external validation tests. Additionally, the quality of q-RASAR predictions was superior to that of ECOSAR predictions. The authors claim the q-RASAR model as an essential tool for screening the emerging TWP-derived compounds of concern.
Case study 19: application of q-RASAR for modeling the acute toxicity of pesticides on chicken
Das et al. (2024) used the q-RASAR approach to assess the toxicity of pesticides on chicken. Taking the lowest observed effect level (LOEL) and no observed effect level (NOEL) values as endpoints, the authors computed the RASAR descriptors and used various ML modeling approaches to generate models. They inferred that the PLS q-RASAR model was the best in terms of external predictivity.
Case study 20: application of q-RASAR for modeling the toxicity of diverse chemicals toward T. pyriformis
Ghosh et al. (2024) employed the q-RASAR approach to predict the toxicity of 1792 chemical toxicants (including ketones, aldehydes, organohalogens, isothiocyanic acids, nitriles, amines, amides, esters, and alcohols) on Tetrahymena pyriformis. The external validation metrics suggest that the q-RASAR approach generated better internal and external predictivity than the corresponding QSAR model.
Footnotes
Declaration of interest
The GenRA tool was developed by the US EPA, to which GP and NC are affiliated. The manuscript reflects the opinions of GP and NC which are not reflective of the opinions or policies of the US EPA. The ToxRead and VERA tools were developed by EB. AB and KR developed the quantitative Read-Across tool (Read-Across-v4.2.1) and the RASAR descriptor calculator tool (RASAR-Desc-Calc-v3.0.2). This work is supported by Life Science Research Board (LSRB, DRDO), New Delhi (LSRB/01/15001/M/LSRB-394/SH&DD/2022).
References
- Aguayo-Orozco A, Taboureau O, Brunak S. 2019. The use of systems biology in chemical risk assessment. Curr Opin Toxicol. 15:48–54. doi: 10.1016/j.cotox.2019.03.003. [DOI] [Google Scholar]
- Bajusz D, Rácz A, Héberger K. 2015. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform. 7(1):20. doi: 10.1186/s13321-015-0069-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ball N, Cronin MT, Shen J, Blackburn K, Booth ED, Bouhifd M, Donley E, Egnash L, Hastings C, Juberg DR, et al. 2016. Toward Good Read-Across Practice (GRAP) guidance. ALTEX. 33(2):149–166. doi: 10.14573/altex.1601251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ball N, Madden J, Paini A, Mathea M, Palmer AD, Sperber S, Hartung T, van Ravenzwaay B. 2020. Key read across framework components and biology-based improvements. Mutat Res Genet Toxicol Environ Mutagen. 853:503172. doi: 10.1016/j.mrgentox.2020.503172. [DOI] [PubMed] [Google Scholar]
- Banerjee A, Chatterjee M, De P, Roy K. 2022. Quantitative predictions from chemical read-across and their confidence measures. Chemom Intell Lab Syst. 227:104613. doi: 10.1016/j.chemolab.2022.104613. [DOI] [Google Scholar]
- Banerjee A, De P, Kumar V, Kar S, Roy K. 2022. Quick and efficient quantitative predictions of androgen receptor binding affinity for screening endocrine disruptor chemicals using 2D-QSAR and chemical read-across. Chemosphere. 309(Pt 1):136579. doi: 10.1016/j.chemosphere.2022.136579. [DOI] [PubMed] [Google Scholar]
- Banerjee A, Gajewicz-Skretna A, Roy K. 2023. A machine learning q-RASPR approach for efficient predictions of the specific surface area of perovskites. Mol Inform. 42(4):e2200261. doi: 10.1002/minf.202200261. [DOI] [PubMed] [Google Scholar]
- Banerjee A, Kar S, Pore S, Roy K. 2023. Efficient predictions of cytotoxicity of TiO2-based multi-component nanoparticles using a machine learning-based q-RASAR approach. Nanotoxicology. 17(1):78–93. doi: 10.1080/17435390.2023.2186280. [DOI] [PubMed] [Google Scholar]
- Banerjee A, Roy K. 2022. First report of q-RASAR modeling toward an approach of easy interpretability and efficient transferability. Mol Divers. 26(5):2847–2862. doi: 10.1007/s11030-022-10478-6. [DOI] [PubMed] [Google Scholar]
- Banerjee A, Roy K. 2023a. On some novel similarity-based functions used in the ML-based q-RASAR approach for efficient quantitative predictions of selected toxicity end points. Chem Res Toxicol. 36(3):446–464. doi: 10.1021/acs.chemrestox.2c00374. [DOI] [PubMed] [Google Scholar]
- Banerjee A, Roy K. 2023b. Read-across and RASAR tools from the DTC laboratory. In: Kar S, Leszczynski J, editors. Current trends in computational modeling for drug discovery. Challenges and advances in computational chemistry and physics. Vol. 35. Cham: Springer. [Google Scholar]
- Banerjee A, Roy K. 2023c. Prediction-inspired intelligent training for the development of classification read-across structure–activity relationship (c-RASAR) models for organic skin sensitizers: assessment of classification error rate from novel similarity coefficients. Chem Res Toxicol. 36(9):1518–1531. doi: 10.1021/acs.chemrestox.3c00155. [DOI] [PubMed] [Google Scholar]
- Banerjee A, Roy K. 2023d. Machine-learning-based similarity meets traditional QSAR: “q-RASAR” for the enhancement of the external predictivity and detection of prediction confidence outliers in an hERG toxicity dataset. Chemom Intell Lab Syst. 237:104829. doi: 10.1016/j.chemolab.2023.104829. [DOI] [Google Scholar]
- Banerjee A, Roy K. 2023e. Read-across-based intelligent learning: development of a global q-RASAR model for the efficient quantitative predictions of skin sensitization potential of diverse organic chemicals. Environ Sci Process Impacts. 25(10):1626–1644. doi: 10.1039/D3EM00322A. [DOI] [PubMed] [Google Scholar]
- Banerjee A, Roy K. 2024a. ARKA: a framework of dimensionality reduction for machine-learning classification modeling, risk assessment, and data gap-filling of sparse environmental toxicity data. Environ Sci Process Impacts. 26(6):991–1007. doi: 10.1039/D4EM00173G. [DOI] [PubMed] [Google Scholar]
- Banerjee A, Roy K. 2024b. How to correctly develop q-RASAR models for predictive cheminformatics. Expert Opin Drug Discov. 2024:1–6. doi: 10.1080/17460441.2024.2376651. [DOI] [PubMed] [Google Scholar]
- Banjare P, Singh R, Pandey NK, Matore BW, Murmu A, Singh J, Roy PP. 2024. In silico soil degradation and ecotoxicity analysis of veterinary pharmaceuticals on terrestrial species: first report. Toxicol Res. 13(1): tfae020. doi: 10.1093/toxres/tfae020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barber C, Heghes C, Johnston L. 2024. A framework to support the application of the OECD guidance documents on (Q)SAR model validation and prediction assessment for regulatory decisions. Comput Toxicol. 30:100305. doi: 10.1016/j.comtox.2024.100305. [DOI] [Google Scholar]
- Bawden D 1988. Browsing and clustering of chemical structures. In: Warr WA, editor. Chemical structures. Berlin: Springer Verlag; p. 145–150. [Google Scholar]
- Bender A, Glen RC. 2004. Molecular similarity: a key technique in molecular informatics. Org Biomol Chem. 2(22):3204–3218. doi: 10.1039/B40981G. [DOI] [PubMed] [Google Scholar]
- Bender A, Jenkins JL, Scheiber J, Sukuru SCK, Glick M, Davies JW. 2009. How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J Chem Inf Model. 49(1):108–119. doi: 10.1021/ci800249s [DOI] [PubMed] [Google Scholar]
- Benfenati E, Belli M, Borges T, Casimiro E, Cester J, Fernandez A, Gini G, Honma M, Kinzl M, Knauf R, et al. 2016. Results of a round-robin exercise on read-across. SAR QSAR Environ Res. 27(5):371–384. doi: 10.1080/1062936X.2016.1178171. [DOI] [PubMed] [Google Scholar]
- Benfenati E, Chaudhry Q, Gini G, Dorne JL. 2019. Integrating in silico models and read-across methods for predicting toxicity of chemicals: a step-wise strategy. Environ Int. 131:105060. doi: 10.1016/j.envint.2019.105060. [DOI] [PubMed] [Google Scholar]
- Bengio Y, Courville A, Vincent P. 2013. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell. 35(8):1798–1828. doi: 10.1109/TPAMI.2013.50. [DOI] [PubMed] [Google Scholar]
- Berggren E, Amcoff P, Benigni R, Blackburn K, Carney E, Cronin M, Deluyker H, Gautier F, Judson RS, Kass GEN, et al. 2015. Chemical safety assessment using read-across: assessing the use of novel testing methods to strengthen the evidence base for decision making. Environ Health Perspect. 123(12):1232–1240. https://ehp.niehs.nih.gov/doi/10.1289/ehp.1409342. doi: 10.1289/ehp.1409342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blackburn K, Stuard SB. 2014. A framework to facilitate consistent characterization of read across uncertainty. Regul Toxicol Pharmacol. 68(3): 353–362. doi: 10.1016/j.yrtph.2014.01.004. [DOI] [PubMed] [Google Scholar]
- Bolcato G, Heid E, Boström J. 2022. On the value of using 3D shape and electrostatic similarities in deep generative methods. J Chem Inf Model. 62(6):1388–1398. doi: 10.1021/acs.jcim.1c01535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boyce M, Meyer B, Grulke C, Lizarraga L, Patlewicz G. 2022. Comparing the performance and coverage of selected in silico (liver) metabolism tools relative to reported studies in the literature to inform analog selection in read-across: a case study. Comput Toxicol. 21:1–15. doi: 10.1016/j.comtox.2021.100208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brown N 2015. Appendix D: RDKit. In silico medicinal chemistry: computational methods to support drug design. Cambridge: Royal Society of Chemistry; p. 199–200. [Google Scholar]
- Bunin BA, Siesel B, Morales GA, Barath J. 2006. Chemoinformatics: theory, practice, & products. Dordrecht: Springer. [Google Scholar]
- Cai C, Wang S, Xu Y, Zhang W, Tang K, Ouyang Q, Lai L, Pei J. 2020. Transfer learning for drug discovery. J Med Chem. 63(16):8683–8694. doi: 10.1021/acs.jmedchem.9b02147. [DOI] [PubMed] [Google Scholar]
- Capecchi A, Probst D, Reymond J-L. 2020. One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J Cheminform. 12(1):43. doi: 10.1186/s13321-020-00445-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cavasotto CN, Scardino V. 2022. Machine learning toxicity prediction: latest advances by toxicity end point. ACS Omega. 7(51):47536–47546. doi: 10.1021/acsomega.2c05693. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cayley A 1874. On the mathematical theory of isomers. Phil Mag. 47(314):444–447. doi: 10.1080/14786447408641058. [DOI] [Google Scholar]
- Chakravarti S 2023. Computational prediction of metabolic α-carbon hydroxylation potential of N-nitrosamines: overcoming data limitations for carcinogenicity assessment. Chem Res Toxicol. 36(6):959–970. doi: 10.1021/acs.chemrestox.3c00083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chatterjee M, Banerjee A, De P, Gajewicz-Skretna A, Roy K. 2022. A novel quantitative read-across tool designed purposefully to fill the existing gaps in nanosafety data. Environ Sci Nano. 9(1):189–203. doi: 10.1039/D1EN00725D. [DOI] [Google Scholar]
- Chatterjee M, Banerjee A, Tosi S, Carnesecchi E, Benfenati E, Roy K. 2023. Machine learning-based q-RASAR modeling to predict acute contact toxicity of binary organic pesticide mixtures in honey bees. J Hazard Mater. 460:132358. doi: 10.1016/j.jhazmat.2023.132358. [DOI] [PubMed] [Google Scholar]
- Chatterjee M, Roy K. 2023. “Data fusion” quantitative read-across structure–activity–activity relationships (q-RASAARs) for the prediction of toxicities of binary and ternary antibiotic mixtures toward three bacterial species. J Hazard Mater. 459:132129. doi: 10.1016/j.jhazmat.2023.132129. [DOI] [PubMed] [Google Scholar]
- Chavan S, Friedman R, Nicholls IA. 2015. Acute toxicity-supported chronic toxicity prediction: a k-nearest neighbor coupled read-across strategy. Int J Mol Sci. 16(5):11659–11677. doi: 10.3390/ijms160511659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen S, Sun G, Fan T, Li F, Xu Y, Zhang N, Zhao L, Zhong R. 2023. Ecotoxicological QSAR study of fused/non-fused polycyclic aromatic hydrocarbons (FNFPAHs): assessment and priority ranking of the acute toxicity to Pimephales promelas by QSAR and consensus modeling methods. Sci Total Environ. 876:162736. doi: 10.1016/j.scitotenv.2023.162736. [DOI] [PubMed] [Google Scholar]
- Couper A 1958. On a new chemical theory. Philos Mag. 16:104–116. doi: 10.1080/14786445808642541. [DOI] [Google Scholar]
- Crawford SE, Hartung T, Hollert H, Mathes B, van Ravenzwaay B, Steger-Hartmann T, Studer C, Krug HF. 2017. Green toxicology: a strategy for sustainable chemical and material development. Environ Sci Eur. 29(1): 16. doi: 10.1186/s12302-017-0115-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cronin MTD, Bauer FJ, Bonnell M, Campos B, Ebbrell DJ, Firman JW, Gutsell SJ, Hodges G, Patlewicz G, Sapounidou M, et al. 2022. A scheme to evaluate structural alerts to predict toxicity – assessing confidence by characterising uncertainties. Regul Toxicol Pharmacol. 135:105249. doi: 10.1016/j.yrtph.2022.105249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cronin MTD, Enoch SJ, Mellor CL, Przybylak KR, Richarz A-N, Madden JC. 2017. In silico prediction of organ level toxicity: linking chemistry to adverse effects. Toxicol Res. 33(3):173–182. doi: 10.5487/TR.2017.33.3.173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cronin MTD, Richarz A-N. 2017. Relationship between adverse outcome pathways and chemistry-based in silico models to predict toxicity. Applied In Vitro Toxicol. 3(4):286–297. doi: 10.1089/aivt.2017.0021. [DOI] [Google Scholar]
- Cross KP, Ponting DJ. 2021. Developing structure–activity relationships for N-nitrosamine activity. Comput Toxicol. 20:100186. doi: 10.1016/j.comtox.2021.100186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Das S, Samal A, Ojha PK. 2024. Chemometrics-driven prediction and prioritization of diverse pesticides on chickens for addressing hazardous effects on public health. J Hazard Mater. 471:134326. doi: 10.1016/j.jhazmat.2024.134326. [DOI] [PubMed] [Google Scholar]
- Daylight Chemical Information Systems Inc. 2024. [accessed 2024 Jul 24]. https://www.daylight.com.
- Dimitrov SD, Diderich R, Sobanski T, Pavlov TS, Chankov GV, Chapkanov AS, Karakolev YH, Temelkov SG, Vasilev RA, Gerova KD, et al. 2016. QSAR toolbox – workflow and major functionalities. SAR QSAR Environ Res. 27(3):203–219. doi: 10.1080/1062936X.2015.1136680. [DOI] [PubMed] [Google Scholar]
- Dong J, Cao D-S, Miao H-Y, Liu S, Deng B-C, Yun Y-H, Wang N-N, Lu A-P, Zeng W-B, Chen AF, et al. 2015. ChemDes: an integrated web-based platform for molecular descriptor and fingerprint computation. J Cheminform. 7(1):60. doi: 10.1186/s13321-015-0109-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Durant JL, Leland BA, Henry DR, Nourse JG. 2002. Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci. 42(6):1273–1280. doi: 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]
- ECHA. 2013. Grouping of substances and read-across approach. Helsinki: ECHA-13-R-02-EN. [Google Scholar]
- ECHA. 2014. Illustrative example with the OECD QSAR Toolbox workflow – Part 1: introductory note. https://echa.europa.eu/documents/10162/1135266/illustrative_example_qsar_part2b_en.pdf/fdd5f115-faee-45ec-ac95-037b84c72ec0.
- ECHA. 2017. Read-across assessment framework (RAAF). https://echa.europa.eu/documents/10162/13628/raaf_en.pdf/614e5d61-891d-4154-8a47-87efebd1851a.
- Enoch SJ, Cronin MTD, Ellison CM. 2011. The use of a chemistry-based profiler for covalent DNA binding in the development of chemical categories for read-across for genotoxicity. Altern Lab Anim. 39(2):131–145. doi: 10.1177/026119291103900206. [DOI] [PubMed] [Google Scholar]
- Escher SE, Aguayo-Orozco A, Benfenati E, Bitsch A, Braunbeck T, Brotzmann K, Bois F, van der Burg B, Castel J, Exner T, et al. 2022. Integrate mechanistic evidence from new approach methodologies (NAMs) into a read-across assessment to characterise trends in shared mode of action. Toxicol In Vitro. 79:105269. doi: 10.1016/j.tiv.2021.105269. [DOI] [PubMed] [Google Scholar]
- Feunang YD, Eisner R, Knox C, Chepelev L, Hastings J, Owen G, Fahy E, Steinbeck C, Subramanian S, Bolton E, et al. 2016. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J Cheminform. 8(1):61. doi: 10.1186/s13321-016-0174-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fischer I, Milton C, Wallace H. 2020. Toxicity testing is evolving! Toxicol Res. 9(2):67–80. doi: 10.1093/toxres/tfaa011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Foster MJ, Patlewicz G, Shah I, Haggard DE, Judson RS, Friedman KP. 2022. Evaluating structure-based activity in a high-throughput assay for steroid biosynthesis. Comput Toxicol. 24:1–23. doi: 10.1016/j.comtox.2022.100245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gadaleta D, Golbamaki Bakhtyari A, Lavado GJ, Roncaglioni A, Benfenati E. 2020. Automated integration of structural, biological and metabolic similarities to improve read-across. ALTEX. 37(3):469–481. doi: 10.14573/altex.2002281. [DOI] [PubMed] [Google Scholar]
- Gajewicz A 2017a. Development of valuable predictive read-across models based on “real-life” (sparse) nanotoxicity data. Environ Sci Nano. 4(6):1389–1403. doi: 10.1039/C7EN00102A. [DOI] [Google Scholar]
- Gajewicz A 2017b. What if the number of nanotoxicity data is too small for developing predictive nano-QSAR models? An alternative read-across based approach for filling data gaps. Nanoscale. 9(24):8435–8448. doi: 10.1039/C7NR02211E. [DOI] [PubMed] [Google Scholar]
- Gallagher A, Kar S. 2024. Unveiling first report on in silico modeling of aquatic toxicity of organic chemicals to Labeo rohita (Rohu) employing QSAR and q-RASAR. Chemosphere. 349:140810. doi: 10.1016/j.chemosphere.2023.140810. [DOI] [PubMed] [Google Scholar]
- Gellatly N, Sewell F. 2019. Regulatory acceptance of in silico approaches for the safety assessment of cosmetic-related substances. Comput Toxicol. 11:82–89. doi: 10.1016/j.comtox.2019.03.003. [DOI] [Google Scholar]
- Ghosh S, Chatterjee M, Roy K. 2023a. Predictive quantitative read-across structure–property relationship modeling of the retention time (log tR) of pesticide residues present in foods and vegetables. J Agric Food Chem. 71(24):9538–9548. doi: 10.1021/acs.jafc.3c01438. [DOI] [PubMed] [Google Scholar]
- Ghosh S, Chatterjee M, Roy K. 2023b. Quantitative read-across structure–activity relationship (q-RASAR): a new approach methodology to model aquatic toxicity of organic pesticides against different fish species. Aquat Toxicol. 265:106776. doi: 10.1016/j.aquatox.2023.106776. [DOI] [PubMed] [Google Scholar]
- Ghosh V, Bhattacharjee A, Kumar A, Ojha PK. 2024. q-RASTR modelling for prediction of diverse toxic chemicals towards T. pyriformis. SAR QSAR Environ Res. 35(1):11–30. doi: 10.1080/1062936X.2023.2298452. [DOI] [PubMed] [Google Scholar]
- Gini G, Franchi AM, Manganaro A, Golbamaki A, Benfenati E. 2014. ToxRead: a tool to assist in read across and its use to assess mutagenicity of chemicals. SAR QSAR Environ Res. 25(12):999–1011. doi: 10.1080/1062936X.2014.976267. [DOI] [PubMed] [Google Scholar]
- Giordano D, Biancaniello C, Argenio MA, Facchiano A. 2022. Drug design by pharmacophore and virtual screening approach. Pharmaceuticals. 15(5):646. doi: 10.3390/ph15050646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gocht T, Berggren E, Ahr HJ, Cotgreave I, Cronin MTD, Daston G, Hardy B, Heinzle E, Hescheler J, Knight DJ, et al. 2015. The SEURAT-1 approach towards animal free human safety assessment. ALTEX. 32(1): 9–24. doi: 10.14573/altex.1408041. [DOI] [PubMed] [Google Scholar]
- Golbamaki A, Franchi AM, Manganelli S, Manganaro A, Gini G, Benfenati E. 2017. ToxDelta: a new program to assess how dissimilarity affects the effect of chemical substances. Drug Des. 06(03):1000153. doi: 10.4172/2169-0138.1000153. [DOI] [Google Scholar]
- Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernández-Lobato JM, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A. 2018. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci. 4(2):268–276. doi: 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grimme S 2006. Semiempirical hybrid density functional with perturbative second-order correlation. J Chem Phys. 124(3):034108. doi: 10.1063/1.2148954. [DOI] [PubMed] [Google Scholar]
- Groff L, Williams A, Shah I, Patlewicz G. 2024. MetSim: integrated programmatic access and simulators. Chem Res Toxicol. 37(5):685–697. doi: 10.1021/acs.chemrestox.3c00398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gütlein M, Karwath A, Kramer S. 2012. CheS-Mapper – chemical space mapping and visualization in 3D. J Cheminform. 4(1):7. doi: 10.1186/1758-2946-4-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hall LH, Kier LB. 1995. Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. J Chem Inf Comput Sci. 35(6):1039–1045. doi: 10.1021/ci00028a014. [DOI] [Google Scholar]
- Harrison PJ. 1968. A method of cluster analysis and some applications. Appl Stat. 17(3):226–236. doi: 10.2307/2985640. [DOI] [Google Scholar]
- Hasse A, Klaessig F. 2018. EU US Roadmap Nanoinformatics 2030. doi: 10.5281/zenodo.1486012. [DOI] [Google Scholar]
- Helman G, Patlewicz G, Shah I. 2019. Quantitative prediction of repeat dose toxicity values using GenRA. Regul Toxicol Pharmacol. 109: 104480. doi: 10.1016/j.yrtph.2019.104480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Helman G, Shah I, Patlewicz G. 2018. Extending the Generalised Read-Across approach (GenRA): a systematic analysis of the impact of physico-chemical property information on read-across performance. Comput Toxicol. 8:34–50. doi: 10.1016/j.comtox.2018.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Helman G, Shah I, Patlewicz G. 2019. Transitioning the Generalised Read-Across approach (GenRA) to quantitative predictions: a case study using acute oral toxicity data. Comput Toxicol. 12:100097. doi: 10.1016/j.comtox.2019.100097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Helman G, Shah I, Williams AJ, Edwards J, Dunne J, Patlewicz G. 2019. Generalized read-across (GenRA): a workflow implemented into the EPA CompTox chemicals dashboard. ALTEX. 36(3):462–465. doi: 10.14573/altex.1811292. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hemmerich J, Ecker GF. 2020. In silico toxicology: from structure–activity relationships towards deep learning and adverse outcome pathways. Wiley Interdiscip Rev Comput Mol Sci. 10(4):e1475. doi: 10.1002/wcms.1475. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hewitt M, Ellison CM, Enoch SJ, Madden JC, Cronin MTD. 2010. Integrating (Q)SAR models, expert systems and read-across approaches for the prediction of developmental toxicity. Reprod Toxicol. 30(1):147–160. doi: 10.1016/j.reprotox.2009.12.003. [DOI] [PubMed] [Google Scholar]
- Jiang J-R, Cai W-X, Chen Z-F, Liao X-L, Cai Z 2024. Prediction of acute toxicity for Chlorella vulgaris caused by tire wear particle-derived compounds using quantitative structure–activity relationship models. Water Res. 256:121643. doi: 10.1016/j.watres.2024.121643. [DOI] [PubMed] [Google Scholar]
- Johnson MA, Maggiora GM, editors. 1990. Concepts and applications of molecular similarity. New York: John Wiley. [Google Scholar]
- Kar S, Leszczynski J. 2020. Open access in silico tools to predict the ADMET profiling of drug candidates. Expert Opin Drug Discov. 15(12): 1473–1487. doi: 10.1080/17460441.2020.1798926. [DOI] [PubMed] [Google Scholar]
- Kar S, Leszczynski J. 2021. QSAR and machine learning modeling of toxicity of nanomaterials: a risk assessment approach. In: Njuguna J, Pielichowski K, Zhu H, editors. Health and environmental safety of nanomaterials. 2nd ed. Cambridge, UK: Woodhead Publishing. p. 417–441. [Google Scholar]
- Kar S, Sanderson H, Roy K, Benfenati E, Leszczynski J. 2020. Ecotoxicological assessment of pharmaceuticals and personal care products using predictive toxicology approaches. Green Chem. 22(5): 1458–1516. doi: 10.1039/C9GC03265G. [DOI] [Google Scholar]
- Kearnes S, Pande V. 2016. ROCS-derived features for virtual screening. J Comput Aided Mol Des. 30(8):609–617. doi: 10.1007/s10822-016-9959-3. [DOI] [PubMed] [Google Scholar]
- Kekulé A 1958. Ueber die Constitution und die Metamorphosen der chemischen Verbindungen und über die chemische Natur des Kohlenstoffs. Justus Liebigs Ann Chem. 106(2):129–159. doi: 10.1002/jlac.18581060202. [DOI] [Google Scholar]
- Kim S, Thiessen PA, Cheng T, Zhang J, Gindulyte A, Bolton EE. 2019. PUG-View: programmatic access to chemical annotations integrated in PubChem. J Cheminform. 11(1):56. doi: 10.1186/s13321-019-0375-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kirsch P, Hartman AM, Hirsch AKH, Empting M. 2019. Concepts and core principles of fragment-based drug design. Molecules. 24(23):4309. doi: 10.3390/molecules24234309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kostal J, Voutchkova-Kostal A. 2016. CADRE-SS, an in silico tool for predicting skin sensitization potential based on modeling of molecular interactions. Chem Res Toxicol. 29(1):58–64. doi: 10.1021/acs.chemrestox.5b00392. [DOI] [PubMed] [Google Scholar]
- Kostal J, Voutchkova-Kostal A. 2023. Quantum-mechanical approach to predicting the carcinogenic potency of N-nitroso impurities in pharmaceuticals. Chem Res Toxicol. 36(2):291–304. doi: 10.1021/acs.chemrestox.2c00380. [DOI] [PubMed] [Google Scholar]
- Kovarich S, Ceriani L, Gatnik MF, Bassan A, Pavan M. 2019. Filling data gaps by read-across: a mini review on its application. Developments and challenges. Mol Inform. 38:1800121. doi: 10.1002/minf.201800121. [DOI] [PubMed] [Google Scholar]
- Kumar A, Kumar V, Ojha PK, Roy K. 2024. Chronic aquatic toxicity assessment of diverse chemicals on Daphnia magna using QSAR and chemical read-across. Regul Toxicol Pharmacol. 148:105572. doi: 10.1016/j.yrtph.2024.105572. [DOI] [PubMed] [Google Scholar]
- Kumar A, Ojha PK, Roy K. 2024. First report on pesticide sub-chronic and chronic toxicities against dogs using QSAR and chemical read-across. SAR QSAR Environ Res. 35(3):241–263. doi: 10.1080/1062936X.2024.2320143. [DOI] [PubMed] [Google Scholar]
- Kumar P, Kcat RL, Sigamani G. 2019. 7D QSAR based grid maps generated using quantum mechanic probes to identify hotspots and predict activity of mutated enzymes for enzyme engineering. In: “Enzyme Engineering XXV”, Huimin Zhao, University of Illinois at Urbana-Champaign, USA John Wong, Pfizer, USA. ECI Symposium Series. https://dc.engconfintl.org/enzyme_xxv/127. [Google Scholar]
- Kumar V, Banerjee A, Roy K. 2024. Breaking the barriers: machine learning-based c-RASAR approach for accurate blood–brain barrier permeability prediction. J Chem Inf Model. 64(10):4298–4309. doi: 10.1021/acs.jcim.4c00433. [DOI] [PubMed] [Google Scholar]
- Labute P 2000. A widely applicable set of descriptors. J Mol Graph Model. 18(4–5):464–477. doi: 10.1016/s1093-3263(00)00068-1. [DOI] [PubMed] [Google Scholar]
- Lester C, Byrd E, Shobair M, Yan G. 2023. Quantifying analog suitability for SAR-based read-across toxicological assessment. Chem Res Toxicol. 36(2):230–242. doi: 10.1021/acs.chemrestox.2c00311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li F, Wang P, Fan T, Zhang N, Zhao L, Zhong R, Sun G. 2024. Prioritization of the ecotoxicological hazard of PAHs towards aquatic species spanning three trophic levels using 2D-QSTR, read-across and machine learning-driven modelling approaches. J Hazard Mater. 465: 133410. doi: 10.1016/j.jhazmat.2023.133410. [DOI] [PubMed] [Google Scholar]
- Lizarraga LE, Dean JL, Kaiser JP, Wesselkamper SC, Lambert JC, Zhao QJ. 2019. A case study on the application of an expert-driven read-across approach in support of quantitative risk assessment of p,p′-dichlorodiphenyldichloroethane. Regul Toxicol Pharmacol. 103:301–313. doi: 10.1016/j.yrtph.2019.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- LoPachin RM, Gavin T, DeCaprio A, Barber DS. 2012. Application of the hard and soft, acids and bases (HSAB) theory to toxicant–target interactions. Chem Res Toxicol. 25(2):239–251. doi: 10.1021/tx2003257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- López-Pérez K, López-López E, Medina-Franco JL, Miranda-Quintana RA. 2023. Sampling and mapping chemical space with extended similarity indices. Molecules. 28(17):6333. doi: 10.3390/molecules28176333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Low Y, Sedykh A, Fourches D, Golbraikh A, Whelan M, Rusyn I, Tropsha A. 2013. Integrative chemical-biological read-across approach for chemical hazard classification. Chem Res Toxicol. 26(8):1199–1208. doi: 10.1021/tx400110f. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lowe CN, Charest N, Ramsland C, Chang DT, Martin TM, Williams AJ. 2023. Transparency in modeling through careful application of OECD’s QSAR/QSPR principles via a curated water solubility data set. Chem Res Toxicol. 36(3):465–478. doi: 10.1021/acs.chemrestox.2c00379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lowe CN, Isaacs KK, McEachran A, Grulke CM, Sobus JR, Ulrich EM, Richard A, Chao A, Wambaugh J, Williams AJ. 2021. Predicting compound amenability with liquid chromatography–mass spectrometry to improve non-targeted analysis. Anal Bioanal Chem. 413(30):7495–7508. doi: 10.1007/s00216-021-03713-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luechtefeld T, Marsh D, Rowlands C, Hartung T. 2018. Machine learning of toxicological big data enables read-across structure activity relationships (RASAR) outperforming animal test reproducibility. Toxicol Sci. 165(1):198–212. doi: 10.1093/toxsci/kfy152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maggiora G, Vogt M, Stumpfe D, Bajorath J. 2014. Molecular similarity in medicinal chemistry. J Med Chem. 57(8):3186–3204. doi: 10.1021/jm401411z. [DOI] [PubMed] [Google Scholar]
- Maldonado AG, Doucet JP, Petitjean M, Fan BT. 2006. Molecular similarity and diversity in chemoinformatics: from theory to applications. Mol Divers. 10(1):39–79. doi: 10.1007/s11030-006-8697-1. [DOI] [PubMed] [Google Scholar]
- Mangiatordi GF, Alberga D, Altomare CD, Carotti A, Catto M, Cellamare S, Gadaleta D, Lattanzi G, Leonetti F, Pisani L, et al. 2016. Mind the gap! A journey towards computational toxicology. Mol Inform. 35(8–9):294–308. doi: 10.1002/minf.201501017. [DOI] [PubMed] [Google Scholar]
- Mansouri K, Grulke CM, Judson RS, William AJ. 2018. OPERA models for predicting physico-chemical properties and environmental fate endpoints. J Cheminform. 10(1):10. doi: 10.1186/s13321-018-0263-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin MT, Dix DJ. 2009. U.S. EPA’s Toxicity Reference Database: Martin and Dix respond. Environ Health Perspect. 117(10):A432–A433. doi: 10.1289/ehp.0900951R. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McQuarrie DA. 2007. Quantum chemistry. 2nd ed. NY: University Science Books. [Google Scholar]
- Meanwell NA. 2014. The influence of bioisosteres in drug design: tactical applications to address developability problems. Tactics Contemp Drug Des. 9:283–381. [Google Scholar]
- Mikolajczyk A, Sizochenko N, Mulkiewicz E, Malankowska A, Rasulev B, Puzyn T. 2019. A chemoinformatics approach for the characterization of hybrid nanomaterials: safer and efficient design perspective. Nanoscale. 11(24):11808–11818. doi: 10.1039/C9NR01162E. [DOI] [PubMed] [Google Scholar]
- Morlini I, Zani S. 2012. Dissimilarity and similarity measures for comparing dendrograms and their applications. Adv Data Anal Classif. 6(2): 85–105. doi: 10.1007/s11634-012-0106-2. [DOI] [Google Scholar]
- Muratov EN, Bajorath J, Sheridan RP, Tetko IV, Filimonov D, Poroikov V, Oprea TI, Baskin II, Varnek A, Roitberg A, et al. 2020. QSAR without borders. Chem Soc Rev. 49(11):3525–3564. doi: 10.1039/D0CS00098A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nakamura T, Sakaue S, Fujii K, Harabuchi Y, Maeda S, Iwata S. 2022. Selecting molecules with diverse structures and properties by maximizing submodular functions of descriptors learned with graph neural networks. Sci Rep. 12(1):1124. doi: 10.1038/s41598-022-04967-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nath A, Ojha PK, Roy K. 2023. Computational modeling of aquatic toxicity of polychlorinated naphthalenes (PCNs) employing 2D-QSAR and chemical read-across. Aquat Toxicol. 257:106429. doi: 10.1016/j.aquatox.2023.106429. [DOI] [PubMed] [Google Scholar]
- Niazi SK, Mariam Z. 2023. Recent advances in machine-learning-based chemoinformatics: a comprehensive review. Int J Mol Sci. 24(14): 11488. doi: 10.3390/ijms241411488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nikolova N, Jaworska J. 2003. Approaches to measure chemical similarity – a review. QSAR Comb Sci. 22(9–10):1006–1026. doi: 10.1002/qsar.200330831. [DOI] [Google Scholar]
- O’Boyle NM, Sayle RA. 2016. Comparing structural fingerprints using a literature-based similarity benchmark. J Cheminform. 8(1):36. doi: 10.1186/s13321-016-0148-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- OECD. 2014. Guidance document on the validation of (quantitative) structure–activity relationship [(Q)SAR] models. Paris: OECD Publishing. [Google Scholar]
- OECD. 2017. Guidance on grouping of chemicals. 2nd ed. Series on testing & assessment no. 194. https://www.oecd.org/publications/guidance-on-grouping-of-chemicals-second-edition-9789264274679-en.htm. [Google Scholar]
- OECD. 2023. (Q)SAR assessment framework: guidance for the regulatory assessment of (quantitative) structure–activity relationship models, predictions, and results based on multiple predictions. OECD Series on Testing and Assessment, No. 386, Environment, Health and Safety, Environment Directorate. OECD. https://www.oecd.org/chemicalsafety/risk-assessment/qsar-assessment-framework.pdf. [Google Scholar]
- Pandey SK, Roy K. 2023. Development of a read-across-derived classification model for the predictions of mutagenicity data and its comparison with traditional QSAR models and expert systems. Toxicology. 500:153676. doi: 10.1016/j.tox.2023.153676. [DOI] [PubMed] [Google Scholar]
- Papadiamantis AG, Afantitis A, Tsoumanis A, Valsami-Jones E, Lynch I, Melagraki G. 2021. Computational enrichment of physico-chemical data for the development of a ν-potential read-across predictive model with Isalos Analytics Platform. NanoImpact. 22:100308. doi: 10.1016/j.impact.2021.100308. [DOI] [PubMed] [Google Scholar]
- Patlewicz G, Ball N, Boogaard PJ, Becker RA, Hubesch B. 2015. Building scientific confidence in the development and evaluation of read-across. Regul Toxicol Pharmacol. 72(1):117–133. doi: 10.1016/j.yrtph.2015.03.015. [DOI] [PubMed] [Google Scholar]
- Patlewicz G, Cronin MTD, Helman G, Lambert JC, Lizarraga LE, Shah I. 2018. Navigating through the minefield of read-across frameworks: a commentary perspective. Comput Toxicol. 6:39–54. doi: 10.1016/j.comtox.2018.04.002. [DOI] [Google Scholar]
- Patlewicz G, Karamertzanis P, Friedman KP, Sannicola M, Shah I. 2024. A systematic analysis of read-across within REACH registration dossiers. Comput Toxicol. 30:1–15. doi: 10.1016/j.comtox.2024.100304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patlewicz G, Shah I. 2023. Towards systematic read-across using Generalised Read-Across (GenRA). Comput Toxicol. 25:1–15. doi: 10.1016/j.comtox.2022.100258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paul R, Chatterjee M, Roy K. 2022. First report on soil ecotoxicity prediction against Folsomia candida using intelligent consensus predictions and chemical read-across. Environ Sci Pollut Res Int. 29(58):88302–88317. doi: 10.1007/s11356-022-21937-w. [DOI] [PubMed] [Google Scholar]
- Pestana CB, Firman JW, Cronin MTD. 2021. Incorporating lines of evidence from new approach methodologies (NAMs) to reduce uncertainties in a category based read-across: a case study for repeated dose toxicity. Regul Toxicol Pharmacol. 120:104855. doi: 10.1016/j.yrtph.2020.104855. [DOI] [PubMed] [Google Scholar]
- Probst D, Reymond J-L. 2018. A probabilistic molecular fingerprint for big data settings. J Cheminform. 10(1):66. doi: 10.1186/s13321-018-0321-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Radchenko EV, Makhaeva GF, Palyulin VA, Zefirov NS. 2017. Chemical similarity, shape matching and QSAR. In: Richardson RJ, Johnson DE, editors. Issues in toxicology (2017 Ebook collection). London, UK: Royal Society of Chemistry. [Google Scholar]
- Randic M 1984. On molecular identification numbers. J Chem Inf Comput Sci. 24(3):164–175. doi: 10.1021/ci00043a009. [DOI] [Google Scholar]
- Riniker S, Landrum GA. 2013. Similarity maps – a visualization strategy for molecular fingerprints and machine-learning methods. J Cheminform. 5(1):43. doi: 10.1186/1758-2946-5-43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rogers D, Hahn M. 2010. Extended-connectivity fingerprints. J Chem Inf Model. 50(5):742–754. doi: 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
- Rouvray DH. 1990. The evolution of the concept of molecular similarity. In: Johnson MA, Maggiora GM, editors. Concepts and applications of molecular similarity. New York: John Wiley; p. 15–42. [Google Scholar]
- Rouvray DH. 1992. Definition and role of similarity concepts in the chemical and physical sciences. J Chem Inf Comput Sci. 32(6):580–586. doi: 10.1021/ci00010a002. [DOI] [Google Scholar]
- Roy A, Basak S, Harriss D, Magnuson V. 1984. Neighborhood complexities and symmetry of chemical graphs and their biological applications. In: Avula XJR, Kalman RE, Liapis AI, Rodin EY, editors. Mathematical modelling in science and technology. New York (NY): Elsevier; p. 745–750. [Google Scholar]
- Roy K, Banerjee A. 2024. q-RASAR. A path to predictive cheminformatics. NY: Springer. [DOI] [PubMed] [Google Scholar]
- Roy K, Kar S, Das RN. 2015a. Understanding the basics of QSAR for applications in pharmaceutical sciences and risk assessment. Elsevier; p. 47–80. [Google Scholar]
- Roy K, Kar S, Das RN. 2015b. A primer on QSAR/QSPR modeling. New York (NY): Springer. [Google Scholar]
- Russo DP, Strickland J, Karmaus AL, Wang W, Shende S, Hartung T, Aleksunes LM, Zhu H. 2016. Nonanimal models for acute toxicity evaluations: applying data-driven profiling and read-across. Environ Health Perspect. 127(4):47001. doi: 10.1289/EHP3614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Santana R, Zuluaga R, Gañán P, Arrasate S, Onieva E, González-Díaz H. 2020. Predicting coated-nanoparticle drug release systems with perturbation-theory machine learning (PTML) models. Nanoscale. 12(25): 13471–13483. doi: 10.1039/D0NR01849J. [DOI] [PubMed] [Google Scholar]
- Schultz T, Amcoff P, Berggren E, Gautier F, Klaric M, Knight D, Mahony C, Schwarz M, White A, Cronin M. 2015. A strategy for structuring and reporting a read-across prediction of toxicity. Regul Toxicol Pharmacol. 72(3):586–601. doi: 10.1016/j.yrtph.2015.05.016. [DOI] [PubMed] [Google Scholar]
- Schultz TW, Cronin MTD. 2017. Lessons learned from read-across case studies for repeated-dose toxicity. Regul Toxicol Pharmacol. 88:185–191. doi: 10.1016/j.yrtph.2017.06.011. [DOI] [PubMed] [Google Scholar]
- Schultz TW, Richarz AN, Cronin MTD. 2019. Assessing uncertainty in read-across: questions to evaluate toxicity predictions based on knowledge gained from case studies. Comput Toxicol. 9:1–11. doi: 10.1016/j.comtox.2018.10.003. [DOI] [Google Scholar]
- Shah I, Liu J, Judson RS, Thomas RS, Patlewicz G. 2016. Systematically evaluating read-across prediction and performance using a local validity approach characterized by chemical structure and bioactivity information. Regul Toxicol Pharmacol. 79:12–24. doi: 10.1016/j.yrtph.2016.05.008. [DOI] [PubMed] [Google Scholar]
- Shah I, Tate T, Patlewicz G. 2021. Generalized read-across prediction using genra-py. Bioinformatics. 37(19):3380–3381. doi: 10.1093/bioinformatics/btab210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheffield TY, Judson RS. 2019. Ensemble QSAR modeling to predict multispecies fish toxicity lethal concentrations and points of departure. Environ Sci Technol. 53(21):12793–12802. doi: 10.1021/acs.est.9b03957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Skare JA, Blackburn K, Wu S, Re TA, Duche D, Ringeissen S, Bjerke DL, Srinivasan V, Eisenmann C. 2015. Use of read-across and computer-based predictive analysis for the safety assessment of PEG cocamines. Regul Toxicol Pharmacol. 71(3):515–528. doi: 10.1016/j.yrtph.2015.01.013. [DOI] [PubMed] [Google Scholar]
- Sobańska AW. 2023. In silico assessment of risks associated with pesticides exposure during pregnancy. Chemosphere. 329:138649. doi: 10.1016/j.chemosphere.2023.138649. [DOI] [PubMed] [Google Scholar]
- Sun G, Bai P, Fan T, Zhao L, Zhong R, McElhinney RS, McMurry TBH, Donnelly DJ, McCormick JE, Kelly J, et al. 2023. QSAR and chemical read-across analysis of 370 potential MGMT inactivators to identify the structural features influencing inactivation potency. Pharmaceutics. 15(8):2170. doi: 10.3390/pharmaceutics15082170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tate T, Wambaugh J, Patlewicz G, Shah I. 2021. Repeat-dose toxicity prediction with generalized read-across (GenRA) using targeted transcriptomic data: a proof-of-concept case study. Comput Toxicol. 19:1–12. doi: 10.1016/j.comtox.2021.100171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Todeschini R, Consonni V. 2009. Molecular descriptors for chemoinformatics: volume I: alphabetical listing/volume II: appendices, references. NJ: John Wiley & Sons. [Google Scholar]
- Tro MJ, Charest N, Taitz Z, Shea J-E, Bowers MT. 2019. The classifying autoencoder: gaining insight into amyloid assembly of peptides and proteins. J Phys Chem B. 123(25):5256–5264. doi: 10.1021/acs.jpcb.9b03415. [DOI] [PubMed] [Google Scholar]
- Varsou DD, Sarimveis H. 2021. Apellis: an online tool for read-across model development. Comput Toxicol. 17:100146. doi: 10.1016/j.comtox.2020.100146. [DOI] [Google Scholar]
- Varsou DD, Sarimveis H. 2023. Deimos: a novel automated methodology for optimal grouping. Application to nanoinformatics case studies. Mol Inform. 42(8–9):e202300019. doi: 10.1002/minf.202300019. [DOI] [PubMed] [Google Scholar]
- Viganò EL, Colombo E, Raitano G, Manganaro A, Sommovigo A, Dorne JLC, Benfenati E. 2022. Virtual extensive read-across: a new open-access software for chemical read-across and its application to the carcinogenicity assessment of botanicals. Molecules. 27(19):6605. doi: 10.3390/molecules27196605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weinhold F, Landis CR. 2005. Valency and bonding: a natural bond orbital donor–acceptor perspective. Cambridge (UK): Cambridge University Press. [Google Scholar]
- Wenzel J, Schmidt F, Blumrich M, Amberg A, Czich A. 2022. Predicting DNA-reactivity of N-nitrosamines: a quantum chemical approach. Chem Res Toxicol. 35(11):2068–2084. doi: 10.1021/acs.chemrestox.2c00217. [DOI] [PubMed] [Google Scholar]
- Westmoreland C, Bender HJ, Doe JE, Jacobs MN, Kass GEN, Madia F, Mahony C, Manou I, Maxwell G, Prieto P, et al. 2022. Use of new approach methodologies (NAMs) in regulatory decisions for chemical safety: report from an EPAA Deep Dive Workshop. Regul Toxicol Pharmacol. 135:105261. doi: 10.1016/j.yrtph.2022.105261. [DOI] [PubMed] [Google Scholar]
- Wildman SA, Crippen GM. 1999. Prediction of physicochemical parameters by atomic contributions. J Chem Inf Comput Sci. 39(5):868–873. doi: 10.1021/ci990307l. [DOI] [Google Scholar]
- Willett P 1997. Computational methods for the analysis of molecular diversity. Leiden: ESCOM. [Google Scholar]
- Willett P 2014. The calculation of molecular structural similarity: principles and practice. Mol Inform. 33(6–7):403–413. doi: 10.1002/minf.201400024. [DOI] [PubMed] [Google Scholar]
- Willett P 2016. Chapter 6: molecular similarity approaches in chemoinformatics: early history and literature status. In: Bajorath J, editor. Frontiers in molecular design and chemical information science – Herman Skolnik Award Symposium 2015. Washington (DC): American Chemical Society; p. 67–89. [Google Scholar]
- Williams AJ, Grulke CM, Edwards J, McEachran AD, Mansouri K, Baker NC, Patlewicz G, Shah I, Wambaugh JF, Judson RS, et al. 2017. The CompTox chemistry dashboard: a community data resource for environmental chemistry. J Cheminform. 9(1):61. doi: 10.1186/s13321-017-0247-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang C, Tarkhov A, Marusczyk J, Bienfait B, Gasteiger j, Kleinoeder T, Magdziarz T, Sacher O, Schwab CH, Schwoebel J, et al. 2015. New publicly available chemical query language, CSRML, to support chemotype representations for application to data mining and modeling. J Chem Inf Model. 55(3):510–528. doi: 10.1021/ci500667v. [DOI] [PubMed] [Google Scholar]
- Yang L, Tian R, Li Z, Ma X, Wang H, Sun W. 2023. Data driven toxicity assessment of organic chemicals against Gammarus species using QSAR approach. Chemosphere. 328:138433. doi: 10.1016/j.chemosphere.2023.138433. [DOI] [PubMed] [Google Scholar]
- Yang S, Kar S. 2023. Application of artificial intelligence and machine learning in early detection of adverse drug reactions (ADRs) and drug-induced toxicity. Artif Intell Chem. 1(2):100011. doi: 10.1016/j.aichem.2023.100011. [DOI] [Google Scholar]
- Zhang XM, Liang L, Liu L, Tang MJ. 2021. Graph neural networks and their current applications in bioinformatics. Front Genet. 12:690049. doi: 10.3389/fgene.2021.690049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu H, Bouhifd M, Donley E, Egnash L, Kleinstreuer N, Kroese ED, Liu Z, Luechtefeld T, Palmer J, Pamies D, et al. 2016. Supporting read-across using biological data. ALTEX. 33(2):167–182. doi: 10.14573/altex.1601252. [DOI] [PMC free article] [PubMed] [Google Scholar]
