Skip to main content
Environmental Health Perspectives logoLink to Environmental Health Perspectives
. 2024 Aug 6;132(8):085002. doi: 10.1289/EHP14001

Unlocking the Potential of Clustering and Classification Approaches: Navigating Supervised and Unsupervised Chemical Similarity

Kamel Mansouri 1,, Kyla Taylor 1, Scott Auerbach 1, Stephen Ferguson 1, Rachel Frawley 1, Jui-Hua Hsieh 1, Gloria Jahnke 1, Nicole Kleinstreuer 1, Suril Mehta 1, José T Moreira-Filho 1, Fred Parham 1, Cynthia Rider 1, Andrew A Rooney 1, Amy Wang 1, Vicki Sutherland 1
PMCID: PMC11302584  PMID: 39106156

Abstract

Background:

The field of toxicology has witnessed substantial advancements in recent years, particularly with the adoption of new approach methodologies (NAMs) to understand and predict chemical toxicity. Class-based methods such as clustering and classification are key to NAMs development and application, aiding the understanding of hazard and risk concerns associated with groups of chemicals without additional laboratory work. Advances in computational chemistry, data generation and availability, and machine learning algorithms represent important opportunities for continued improvement of these techniques to optimize their utility for specific regulatory and research purposes. However, due to their intricacy, deep understanding and careful selection are imperative to align the adequate methods with their intended applications.

Objectives:

This commentary aims to deepen the understanding of class-based approaches by elucidating the pivotal role of chemical similarity (structural and biological) in clustering and classification approaches (CCAs). It addresses the dichotomy between general end point–agnostic similarity, often entailing unsupervised analysis, and end point–specific similarity necessitating supervised learning. The goal is to highlight the nuances of these approaches, their applications, and common misuses.

Discussion:

Understanding similarity is pivotal in toxicological research involving CCAs. The effectiveness of these approaches depends on the right definition and measure of similarity, which varies based on context and objectives of the study. This choice is influenced by how chemical structures are represented and the respective labels indicating biological activity, if applicable. The distinction between unsupervised clustering and supervised classification methods is vital, requiring the use of end point–agnostic vs. end point–specific similarity definition. Separate use or combination of these methods requires careful consideration to prevent bias and ensure relevance for the goal of the study. Unsupervised methods use end point–agnostic similarity measures to uncover general structural patterns and relationships, aiding hypothesis generation and facilitating exploration of datasets without the need for predefined labels or explicit guidance. Conversely, supervised techniques demand end point–specific similarity to group chemicals into predefined classes or to train classification models, allowing accurate predictions for new chemicals. Misuse can arise when unsupervised methods are applied to end point–specific contexts, like analog selection in read-across, leading to erroneous conclusions. This commentary provides insights into the significance of similarity and its role in supervised classification and unsupervised clustering approaches. https://doi.org/10.1289/EHP14001

Introduction

Chemical hazard and risk assessments aim to estimate toxicological outcomes and evaluate the potential risks that chemicals may pose to human health and the environment. However, studying chemicals one-by-one is expensive, time-consuming, and can contribute to harmful chemicals being replaced by new chemicals that are structurally similar (to perform the same function in manufacturing or final product as the harmful chemicals did) but lack toxicological information.

Therefore, there has been an increasing shift in focus from individual chemical analyses to assessing groups of chemicals or agents. Class-based assessment approaches often employ high-throughput methods to forecast toxicity by drawing on the similarities of previously characterized compounds. Historical class-based efforts have typically focused on categorizing chemicals solely based on their observed or overall structural similarities.1,2 This approach often overlooks the nuanced aspects of chemical similarity which, depending on how structures are represented and described, might have big implications on relating it to biological activity.3,4 While chemical structure is the primary criterion for grouping substances together, other factors, such as biological activity, internal exposure, and/or pathways, should also be considered for classification studies.2,5,6 Thanks to advances in information technology, including machine learning, various computational techniques can now be deployed to develop more efficient methods for toxicology and hazard assessment by incorporating structural and functional information as well as phenotypic outcomes and mechanistic data. These computational technique advancements update the standard class-based approach by looking beyond general chemical structure properties and utilizing multiple streams of information on select chemicals, allowing for a more specific grouping of the data, to better identify potential concerns.79 Nevertheless, there remains a pressing need to understand the circumstances and techniques under which such approaches can be efficiently and effectively used to evaluate the hazards of chemical exposure on human health.

The aim of this commentary was to provide an overview of clustering and classification approaches (CCAs). The key concepts for defining similarity, considerations around supervised and unsupervised approaches, and current applications in regulatory contexts are discussed. This work also highlights challenges and potential solutions identified, which inspired the development of a workflow applying CCAs,10,11 and proposed future collaborations to further develop best practices and increase the use of CCAs in toxicity research and risk assessment. These topics were discussed in depth at a workshop held by the Division of Translational Toxicology (DTT), National Institute of Environmental Health Sciences (NIEHS). Detailed information about the organization of the event and the recordings of the different sessions, which informed the current commentary, are available at https://www.niehs.nih.gov/news/events/pastmtg/2022/nams2022/index.cfm.

Insights and Practical Applications

Background

The field of toxicology has evolved significantly in recent years, with increasing emphasis on the use of new approach methodologies (NAMs) to understand chemical toxicity.12,13 CCAs are important tools for analyzing chemical data and play a key role in the development and application of NAMs.1,9 However, these methods can be complex and require careful considerations when selecting the appropriate approach for a given application. A wide array of CCAs have been used to interpret toxicological implications of chemical similarity and have been discussed in various guidance documents9,1418 and scholarly publications.1,19,20

However, extending these tools to prioritize chemicals for testing, making risk-based decisions with data-poor chemicals, predicting toxicological effects, and integrating information to understand chemical effects in biological systems was always a challenge. This is, in particular, due to conflation between general chemical features denoting end point–agnostic similarity and end point–specific biological effect–based similarity.3,21

Similarity is a fundamental concept in cheminformatics and is essential for the accurate application of CCAs in chemical grouping as well as other in silico NAMs [i.e., quantitative structure–activity relationships (QSAR) and read-across]. The concept of similarity is contextual and varies based on the analysis’s objectives and the data type being evaluated. Therefore, there is no singular type of similarity or universal definition for it.22,23 Varnek showed an example of nine different geometrical elements that can be grouped based on various types of similarity: shape, color, size, and pattern. Each element can be deemed similar to every other one in one of the four aspects.21 Thus, the selection of similarity here is arbitrary and can only fulfill one definition since opting for one aspect of similarity automatically disregards the other types.

The same concept of similarity applies to chemical structures, which can all exhibit similarities in various ways. Consequently, a list of chemicals can be categorized differently depending on the attributes and features considered crucial for defining the desired similarity. It is essential to distinguish between general end point–agnostic similarity, typically observed visually or arbitrarily and deducted through unsupervised methods, and end point–specific similarity, requiring supervised analysis.21 This stems from the fact that general similarity is not tied to a particular biological activity, whereas end point–specific similarity is determined by the structural characteristics governing a specific biological activity or toxicity. Hence, the selection of a similarity approach relies on the study’s context and goals, which, in turn, are influenced by how chemical structures are represented and encoded in the explanatory variables. Varnek showed a list of 16 chemicals that, while structurally diverse, all share a formyl functional group consisting of a carbonyl bonded to hydrogen.21 Thus, all of these chemicals can be grouped as aldehydes, representing one of the similarities that they share. Alternatively, these chemicals can be grouped based on other similarity aspects, as depicted in the same figure by Varnek.21 The chemicals in this example were initially grouped based on the shared common scaffold, representing the most visually similar grouping.21,2426 However, the same chemicals can also be grouped based on common functional groups.21 Although the resulting grouped chemicals might not visually resemble each other, they share significant functional similarities that could be more relevant for certain biological activities compared to the first grouping.21 Another example where the type of similarity is not visible is constitutional isomers that share the same molecular formula (and weight) but look different due to their diverse atom connectivity (Figure 1).27

Figure 1.

Figure 1 is a set of four chemicals with the same molecular formulation, C6H12. The first chemical, cyclohexane, has a six-carbon ring. The second chemical, 1-hexene, has an unbranched structure. The third chemical, 3-methyl-1-pentene, has one branch. The fourth chemical, 1-ethyl-2-methyl cyclopropane, has a three-carbon ring.

Constitutional isomers sharing the same molecular formula but differ in their connectivity of atoms.

To enable mathematical or statistical computation of similarity levels, chemical characteristics, often referred to as explanatory variables, features, or attributes, must be numerically represented. These variables can be generated directly from structural features and representations, experimental and predicted outcomes (in silico, in vitro, in vivo, physicochemical properties, genomics…), or a combination thereof. Explanatory variables generated based on structure are commonly known as molecular descriptors and fingerprints.

A molecular descriptor is a number or result derived from a logical and mathematical process that turns the symbolic representation of a molecule into useful information for analysis.28 The molecular descriptors encode various types of information, ranging from 0D (e.g., atom counts), 1D (e.g., substructure representation), 2D (e.g., topological representation), 3D (e.g., spatial configuration), to 4D (e.g., stereo-electronic representation).2830 This range of information levels, as depicted in Figure 2, offers significant flexibility in analyzing chemical data and defining similarity for various grouping purposes, including end point–specific applications.28,31,32

Figure 2.

Figure 2 illustrates the following information: A chemical structure can be represented in 5 dimensions, each with additional/more complex information. A chemical structure with zero dimensions reflects a molecular formula, which includes molecular weight, atom count, and atomic characteristics. A chemical structure in one dimension reflects substructure representation, which includes substructural searches and fingerprinting. A two-dimensional chemical structure represents topological representation, such as a molecular graph and atomic connection. A three-dimensional chemical structure represents geometric representation, which includes spatial configuration as well as shape and size characteristics. A four-dimensional chemical structure represents stereo-electronic representation, which includes interactions, grid-based quantitative structure-activity relationships, conformations, and dynamic quantitative structure-activity correlations. The content richness increases from zero to four dimensions. The computational simplicity progresses from four to zero dimensions.

Five levels of information encoded in different types of molecular descriptors. Note: QSAR, quantitative structure–activity relationships.

A fingerprint, conversely, is a string of binary bits (also known as 1D descriptors) that represent the existence or absence of predefined fragments, scaffolds, substructures, functional groups, atom pairs, or a combination thereof (Figure 3).28,33 Examples of commonly used fingerprints include MACCS (Molecular ACCess System), PUBCHEM, and Morgan.3335 Fingerprints are of fixed length and provide a general description of chemicals that is most used for exploratory analysis of database search.3537 Thus, the importance of selecting the type of explanatory variables is based on the type of similarity considered for the grouping.

Figure 3.

Figure 3 is an illustration depicting a molecular fingerprint being a binary string of 0 and 1. The location of the digit in the string represents a specific structural fragment, such as nitrogen and hydroxide, while 1 represents the presence of the specified fragment, and 0 represents its absence.

Illustration of a fingerprint as a binary string indicating the presence or absence of specific substructures.

Similarity assessment can involve utilizing general unfiltered chemical and/or biological response data (end point–agnostic similarity) or a curated subset specifically related to a particular outcome of interest (end point–specific similarity). These scenarios correspond to unsupervised and supervised approaches, respectively.

Unsupervised learning methods, such as clustering, rely on general similarity measures (e.g., chemical features, biological response data from high-throughput assays) to identify patterns and relationships among chemicals without any prior knowledge of the end point of interest (unlabeled data).38,39 These methods are useful in identifying groups of similar chemicals (i.e., chemicals with the most similar features).40 Unsupervised methods are useful for hypothesis generation. However, unsupervised methods may not be suitable for end point–specific grouping of chemicals (e.g., similar structures based on an arbitrary list of features may not produce similar adverse health effects), and their results need to be interpreted in that regard.38,39,41

In contrast, supervised learning methods, such as classification, require end point–specific similarity measures to analyze labeled data for specific biological or functional end points study.5,42 These methods use training data labels that contain information about the adverse findings or end points of interest to learn the relationship between the chemicals’ features and the biological activity. Supervised learning methods are suitable for developing end point–specific hypotheses and building predictive models to assess new chemicals.43,44

While both unsupervised and supervised techniques provide a wealth of information, we argue that it is important to exercise caution to ensure each approach is used for its intended purpose. For example, a common misuse involves the application of unsupervised learning methods as part of a process where the final goal is end point specific.45 An example of this is the unsupervised selection of an initial set of analogues in read-across that targets an end point–specific prediction. For this approach, the analogues selected depend on the first step (i.e., identifying patterns and relationships of the analogues of interest). Essentially, applying an unsupervised method (such as an arbitrarily selected fingerprint with a Tanimoto score) that relies on end point–agnostic similarities can only lead to analogues sharing general features not specific to the target end point. This poses a risk that the analogues selected will be used for different end points, even though the secondary layers might depend on the availability of data. Visual similarity, or end point–agnostic chemical structural similarity based on general arbitrarily selected descriptors or fingerprints (a commonly noticed issue during the selection of initial analogues for read-across), does not necessarily lead to end point–specific similarity, and using the same chemicals for different end point studies can result in erroneous interpretations.39 Similarly, when a supervised approach is used, any attempt to derive a general interpretation will be biased by the initial end point–specific dataset on which the algorithm was trained.38

The distinction between using supervised and unsupervised similarity is crucial for developing end point–specific models and making predictions for new chemicals.39,41,42 Unsupervised methods use an arbitrary choice of the initial explanatory variables and lead to general similarity while supervised methods employ machine learning and/or expert judgement to select the most influential variables for a specific end point leading to a more targeted aspect of similarity. The combination of supervised and unsupervised methods can be relevant at times.38 However, it requires careful consideration to avoid bias and ensure that the similarity aspect in selecting analogues, for example, is appropriate for the end point of interest.46

Chemical Similarity: General and End Point Specific

Understanding similarity is essential for utilizing CCAs; similarity encompasses everything from chemical structure, biological activity, and phenotypic outcomes to the inclusion of mechanistic data and is dependent on the context and goal of research or study. Thus, the importance of distinguishing between end point–specific (supervised) and general/end point–agnostic (unsupervised) similarity cannot be overstated for choosing the appropriate grouping method.38,43

In class-based assessments, we assume that “similar” chemicals belong to one class, where similarity can be defined using different information. Chemical similarity can be based on structural attributes (e.g., bonds, number of aromatic rings, common functional group), property (e.g., physicochemical properties, chemical reactivity), biological and toxicological effect (e.g., biological response, specific toxicity), toxicokinetic (e.g., skin absorption level, same precursor or metabolite), mechanistic [e.g., how a toxicity is induced, mode of action, adverse outcome pathway (AOP), or mechanism of toxicity], or a combination thereof. The word similarity applies to having a consistent trend (e.g., for molecules with a given functional group, boiling point increases with molecular weight) or following a pattern.47,48 However, incorporating more parameters and descriptors does not always lead to better grouping of chemicals.4951 Consonni et al.52 and Varmuza et al.,53 among others, have showcased that exceeding an optimal number of variables in a QSAR model invariably results in overfitting and diminished predictive capability.

We posit that exploration of chemical similarity necessitates the characterization of the molecular structures of chemicals. Molecular descriptors and fingerprints provide avenues for representing molecular structures and their features.35 Through the establishment of a shared representation framework, these tools facilitate the encoding of both detailed and overarching characteristics inherent to molecular architectures.28 This process involves the conversion of the molecular structure into numerical representations, which can be used in machine learning approaches for tasks requiring similarity analysis.28 When selecting appropriate molecular descriptors and fingerprints from the array of available types, one should consider the context and objectives of the study, emphasizing the identification of the most pertinent facets of the structural properties (i.e., the selection of analogues for read-across, the arbitrary choice of different fingerprints lead to different analogues).3,54

Calculating similarity between chemicals is vital for both supervised and unsupervised approaches.2,55 One way to do this is by utilizing continuous molecular descriptors, fingerprints, and toxicological alerts.23,28,35 Similarity between chemicals is calculated by comparing these numerical representations from the chemical pairs. To ensure meaningful results, it is crucial to select descriptors relevant to the specific end point under consideration. When dealing with continuous descriptors, distance metrics like Euclidean or cosine distance can be employed.5659 Binary descriptors, like fingerprints or structural alerts, require specialized measures such as Jaccard–Tanimoto similarity coefficient.33,36,6064 Comparisons of available similarity coefficients can inform selection of appropriate tools.65

In the context of end point–specific studies, similarity determined by supervised approaches is often impacted by activity cliffs.6669 Activity cliffs are defined as cases where chemicals with structurally similar features display large differences in potency for a common biological target.70 The successes and failures of the conventional “similar structures, similar properties” paradigm can be attributed to the intricacies encompassing activity cliffs.67,68 Even a minor difference in structure can significantly impact biological activity. Furthermore, employing arbitrarily chosen and unfiltered general descriptors, such a fixed-length fingerprint, might not be optimal for end point–specific similarity analysis. For instance, using a fingerprint like MACCS along with a similarity index (such as Jaccard–Tanimoto) might consider certain chemicals highly similar, despite their substantial difference in biological activity.21 Therefore, we argue that it is crucial to establish a foundation for quantifying structural similarity and to recognize the implications of this quantification when exploring structure–activity relationships. This entails carefully selecting the most relevant descriptors to the specific end point under investigation.71 This approach selectively encodes the impactful structural features for a particular end point, effectively shaping the supervised similarity, and in turn minimizing the activity cliffs effect.66,72 The concept of activity cliffs can be extended to structural cliffs, wherein structurally dissimilar compounds exhibit similar biological activities.73

When assessing molecular similarity between bioactive substances from a structural bioinformatics perspective, two common representations are employed: 2D structures, which focus on the connectivity of atoms, and 3D structures, which consider the configuration of atoms in three-dimensional space.74 Both 2D and 3D descriptors capture different aspects of molecular properties.29,30,68,75,76 In the case of receptor-targeting drugs, for example, utilizing structural similarity can provide valuable insights.77 By comparing structures of different compounds, researchers can discern patterns that distinguish molecules with binding activity from those without. This is especially important when categorizing agonists, which activate the receptor, and antagonists, which inhibit the receptor activity. Through such analyses, researchers can identify key structural features that are consistently present in binders and absent in nonbinders. This information is invaluable in drug discovery, as it allows for the design of new compounds with optimized binding characteristics, thereby contributing to the development of more effective and targeted pharmaceuticals.3,78

In conclusion, we assert that an investigation of chemical similarity should emphasize its context-dependent nature, raising questions about defining similarity using diverse tools and differentiating end point–agnostic from end point–specific likeness. Defining end point–specific similarity entails selecting the most important among features (structural descriptors, physicochemical properties, biological aspects, etc.) that either separately and/or synergistically govern the studied end point.32,79 The applications and interpretations of similarity vary among stakeholders, including medicinal chemists, QSAR model developers, and regulators. Amid these applications, a consensus emerges that similarity definitions should adapt to specific goals and contexts.38 This extends to integrating diverse data, like high-throughput screening results80,81 and molecular initiating events, to bridge gaps and establish similarity. Although end point–specific frameworks are available for end points including phototoxicity, repeated dose toxicity, and reproductive toxicity, experts in the field collectively acknowledge that a universal method for defining end point–specific similarity is lacking.82 Known good practices include broadly considering available data, adopting precise end point definitions, and collaborating across scientific disciplines. This consensus highlights the importance of a more wholistic approach to navigate the intricate concept of chemical structure similarity.

Approaches: Supervised and Unsupervised

Molecular similarity, while theoretically customizable by attribute, gains practical utility when considered in the context of a specific problem. Detaching similarity from specific case scenarios renders it less practical and often devoid of significance.54 Thus, customizing similarity to suit a given problem is imperative to confer practical value.3 This involves adopting computational methods to represent molecules before employing supervised and unsupervised techniques tailored to the context.

Supervised and unsupervised learning approaches have shown promise in chemical similarity analysis.5,38,43 The distinction between supervised and unsupervised approaches highlights the different ways in which structural similarity analysis can be performed, depending on the availability of labeled data and the specific goals of the study. End point–specific methods leverage prior knowledge about the end point to guide the analysis, whereas end point–agnostic similarity methods explore the overall structural landscape of chemical compounds in an unbiased manner.

While both supervised and unsupervised approaches can start with the same features and use similar visualization tools, supervised learning tries to predict a specific property, while unsupervised learning explores general similarities without a specific outcome in mind. Unsupervised clustering can help identify important features in each cluster, which can be used for supervised classification. Supervised learning requires applying algorithms for variable selection such as genetic algorithms and simulated annealing.79,8386 Unsupervised methods can be useful for exploring the structural landscape, identifying outliers, and getting a sense of the data space before applying supervised methods.38,87 Unlike supervised methods where variable selection is applied for end point–specific similarity, for unsupervised methods only dimensionality reduction is necessary since high numbers of descriptors and fingerprints can make the interpretation of results overwhelming.39,88,89

Supervised and unsupervised approaches usually employ univariate or multivariate analysis.90,91 However, multivariate calculations have advantages over univariate ones.92 Univariate approaches consider properties independently, which might overlook complex interactions.93 Multivariate methods, like principal component analysis (PCA) and hierarchical clustering, capture these interactions, yielding more comprehensive insights.90,92,9498

Integrating both supervised and unsupervised approaches within the same study is valuable, as they can complement each other effectively.38 When combining different techniques, various synergies might emerge and contribute to a richer understanding of the dataset. In the realm of machine learning, the selection of a particular approach is not unidirectional. Instead, it is a nuanced decision influenced by dataset characteristics and the study objectives. Employing multiple approaches in parallel can yield outcomes that are more interpretable and reliable and enhance the modeling process. When it comes to selecting appropriate analogues and surrogates, context and study goals take center stage. Choosing analogues and surrogates that align with the specific objectives of the study is imperative.82 For example, to identify and justify analogues for systemic toxicity predictions, a combination of metabolic profiling, physicochemical properties, and structural features indicating reactivity potential can be used.99 Robust feature selection techniques play a critical role in ensuring the relevance of the similarity aspect under examination.49,79 In essence, combining supervised and unsupervised techniques enhances the depth of insights acquired. The flexibility to use multiple approaches and the careful selection of analogues, surrogates, and features can uphold the integrity and relevance of the study’s findings, ultimately enhancing the quality of the results.

Considerations in Applying Different Approaches

After understanding the concept of similarity (end point–agnostic and end point–specific) and its uses in supervised and unsupervised learning approaches, we argue that users need to explore the various ways of applying these techniques to conduct clustering and classification studies. We feel that interpreting the outcomes of these approaches, whether used separately or in combination, is key as is implementing best practices.

While class-based assessment is built upon the assumption that similar chemicals share similar properties, not all similarities have equal influence on the studied end point (biological activity, toxicity, physicochemical property, etc.). For instance, a functional group might not contribute to the chemical reactivity because it is hindered or irrelevant. One of the common pitfalls in carrying out read-across is misapplying unsupervised approaches for end point–specific studies.43,82 Conversely, machine learning leverages comprehensive chemical descriptors for toxicity prediction through supervised learning. By comparing and contrasting these approaches, a pathway emerges for their integration. The goal is to enhance prediction accuracy while retaining the transparency inherent in model interpretation. Integrating the strengths of both methods can yield a powerful predictive model that takes advantage of the unsupervised clustering’s interpretability and the supervised classification’s accuracy, contributing to more effective and reliable applications.38 Figure 4 provides an overview of clustering and classification methods, offering recommendations for their appropriate applications and examples of algorithms commonly employed.

Figure 4.

Figure 4 is a tabular representation with eight rows and three columns, namely, grouping method, clustering, and classification. Row 1: learning type, unsupervised, and supervised. Row 2: input date, unlabeled, and labeled. Row 3: variables, explanatory only, and explanatory and response. Row 4: similarity type; general, arbitrary, visual; end point–specific. Row 5: grouping algorithms; k-mean, k-medoids, hierarchical clustering; k-nearest neighbors, support vector machines, native bayes. Row 6: number of groups; unknown arbitrary; and known apriori. Row 7: Purpose; uncover hidden patterns and relationships within the data, place observations from a dataset into a specific cluster, analyze to create rules to identify associations between variables; and mapping the input data to the known and well-defines response, develop model to predict the class of new chemicals, establish and understand relationships between explanatory and response variables. Row 8: illustrations, a graphical representation of clusters, and showing two groups of the same symbol indicating unlabeled data, a graphical representation of classes, showing two groups of different symbols indicating data labels.

Explanation of clustering and classification methods, recommendations for their applications, and examples of used algorithms.

We propose that navigating the realm of supervised and unsupervised clustering and classification necessitates adhering to best practices for optimal outcomes. It is crucial to avoid a common pitfall: constructing models (e.g., using random forest or nearest neighbor approaches) without refining the molecular descriptors for the specific dataset.49,100103 Instead, an iterative supervised feature selection process emerges as a prudent choice. By strategically choosing a subset of descriptors that enhance the performance of a QSAR model within internal cross-validation, the predictive capabilities for new chemicals can be maximized.52,100 Beyond this, establishing best practices for supervised and unsupervised learning in the context of toxicity and physicochemical property datasets is essential.104 Another aspect of this exploration entails assessing how supervised and unsupervised feature selection impact external prediction performance across various QSAR methods like nearest analog, k-nearest neighbors, random forest, and support vector machines. Embracing these practices can improve the clarity and effectiveness of these techniques, enabling insightful analysis and robust predictions.

More broadly, many themes converge when it comes to best practices for using clustering and classification techniques for toxicity assessment of chemicals. One prominent theme involves establishing the precise objective of a study and subsequently selecting the most suitable approach. Throughout this process, we argue that it is essential to carefully consider factors such as end point type, data availability, and the depth of decision-making required. Additionally, it is pivotal to consider the synergy between supervised and unsupervised methods, emphasizing their combined utility. In our opinion, an essential aspect of this story lies in acknowledging the degree of complexity of the end point under examination and understanding its influence on modeling accuracy. In this confluence of themes, we believe the importance of tailored approaches for robust and accurate toxicity assessment emerges.

Another significant theme is the importance of tailoring fundamental concepts for inexperienced users. Examples of this are considering result accessibility and interpretability, along with the requisite training and knowledge accumulation. Amid this, a challenge emerges in assessing the efficacy and applicability of predictive models.58,105,106 Experts propose a collaborative model development process, wherein developers engage users from the outset.107 This proactive approach ensures that end products align closely with users’ most crucial requirements, rather than presenting solutions after the fact. This theme underscores the value of user-centric design and the shared effort to create impactful and user-friendly tools.

A third notable theme is the significance of employing generic features to predict end points and the necessity of factoring in the quality108,109 and quantity of training datasets.110 This theme centers around input considerations, specifically the criteria guiding method and feature selection. We argue that machine learning’s role in harmonizing data across molecules to facilitate meaningful comparisons is also a crucial aspect within this theme. In essence, this theme underscores the careful deliberation required in determining input parameters, which includes selecting appropriate methods and features and ensuring data quality, ultimately contributing to accurate and reliable results.

In summary, a pivotal takeaway is the paramount significance of collaboration and knowledge exchange between tool developers and users of these approaches. Equally important is the recognition of tailoring machine learning and class-based approaches to the distinct objectives and limitations of each study. We believe that this collective approach ensures that the resulting interpretations are not only accurate and effective but also aligned with the real-world needs and constraints of the users.

Applications and Examples

CCAs find diverse applications in chemical assessment, spanning prioritization of toxicological research to risk assessment, and high adaptability in various contexts. A notable instance lies in the European Union’s Registration, Evaluation, Authorization and Restriction of Chemicals (REACH) regulation, where the use of read-across (by showing chemical similarity and a read-across hypothesis) is encouraged for substance registration.111 Here, the integration of both structural and mechanistic similarity is pivotal for successful read-across submissions by REACH registrants.9,112 Delving further into the read-across approaches, the process of analogues selection can be a combination of supervised and unsupervised selection methods. A practical example is the integration of QSAR and read-across.113 The effectiveness of this integration is shaped by the distribution of analogous compounds within the small chemical space around the target.

This structural similarity also proves valuable in the European Chemicals Agency’s (ECHA) grouping of substances for screening and prioritization objectives.111 An exemplar of this functionality is present in the OECD (Organization for Economic Co-operation and Development) QSAR Toolbox (https://qsartoolbox.org/), a software designed for substance grouping, freely accessible through collaboration between ECHA and the OECD. These international bodies have recently published an updated QSAR Assessment Framework, which is intended to support transparent model reporting, verification, validation, and regulatory implementation.114 These practical applications underscore the versatility and relevance of clustering and classification in advancing chemical assessments.

Similarly, an examination of the US Environmental Protection Agency (EPA)’s historical applications and upcoming requirements for prioritization using NAMs reveals their reliance on chemical similarity.115 This involves assessing structural similarity encompassing both two-dimensional (2D) and three-dimensional (3D) aspects, as well as considering bioactivity viewpoints. Such analyses aid in establishing the significance of these approaches in the EPA’s efforts to prioritize chemicals effectively, both in the past and moving forward.116,117

Another illustration involves employing class-based methodologies to assess exposure, hazard, and risk, utilizing subclasses of organohalogen flame retardants (OFRs) as a focal point.118 This comprehensive study encompasses multiple facets of OFR evaluation. It commences with scoping and evidence mapping to categorize existing studies on exposure, mechanisms, and toxicity. Additionally, this investigation delves into exposure, risk, and regulatory implications concerning distinct subclasses of OFRs. This example underscores the versatility of class-based approaches in dissecting multifaceted chemical assessments and regulatory considerations.

Class-based assessments offer a sustainable path forward in different NAMs applications. Effective implementation requires open dialogue among stakeholders, solid case studies, and a focus on education and data management.118 We believe the overarching message is that the collaboration between stakeholders and the continued refinement of class-based methodologies will ultimately shape the future of chemical assessment.

Discussion

Adapting class-based approaches for chemical regulations and hazard/risk evaluations is essential but comes with its set of challenges. Central to this endeavor is understanding the pivotal attributes of chemical structures for targeted similarity assessments. We posit that establishing flexible guidelines tailored to regulatory end points is imperative for navigating this complex landscape effectively. In this section, we delve into the intricacies of adapting class-based methods, focusing on three key takeaways that underscore the importance of robust case studies in evaluating hazards or risks.

How to Choose Between General and End Point–Specific Similarity?

In summary, end point–specific structural similarity is a concept used in cheminformatics and computational chemistry to assess the similarity of chemical compounds based on their structural features with respect to a particular biological or chemical end point. This end point could be the activity against a specific target (e.g., enzyme inhibition, receptor binding), toxicity, metabolic stability, or any other desired property.

The key difference between end point–specific structural similarity and general, end point–agnostic similarity lies in focused optimization vs. a broader consideration of structural characteristics. In end point–specific similarity, the structural features that are considered relevant to the end point of interest are prioritized, while irrelevant features are either minimized or ignored. Conversely, end point–agnostic similarity considers all structural features or general arbitrarily chosen fingerprints without necessarily prioritizing any specific end point.

For example, let’s consider two compounds, Compound A and Compound B, both designed as potential inhibitors of a particular enzyme. End point–specific structural similarity analysis between Compound A and Compound B would primarily focus on the key structural motifs or features known to influence enzyme inhibition. This analysis would disregard other structural aspects that might not be relevant to the inhibition of the enzyme. On the contrary, end point–agnostic similarity analysis between Compound A and Compound B would considers all structural features or fingerprints in general, regardless of their relevance to enzyme inhibition. This analysis provides a broader perspective on the overall structural similarity between the two compounds but may not directly address their similarity in terms of enzyme inhibition potency.

While both end point–specific and end point–agnostic general similarity analyses are valuable in cheminformatics and computational chemistry, their applications differ based on the specific goals of the study, with end point–specific similarity being more targeted toward a particular end point or property of interest, and end point–agnostic similarity providing a broader perspective on overall structural similarity. A comprehensive overview of the two types of similarity is provided in Table 1.

Table 1.

Comparative overview of end point–specific and general chemical structure similarity.

Category End point–specific similarity End point–agnostic similarity
Goals of the study
  • Drug Design: Helps identifying compounds with similar structural features to known active compounds for a specific target, facilitating drug discovery.119

  • Toxicity Prediction: Assists in predicting the toxicity of new compounds by comparing their structures to known toxic compounds for a specific end point.109

  • Metabolism Prediction: Aids in predicting the metabolic fate of compounds by comparing their structures to known substrates or inhibitors of specific metabolic enzymes.45

  • Compound Clustering: Allows grouping compounds into clusters based on their overall structural similarity, providing insights into chemical space exploration and diversity analysis.64

  • Virtual Screening: Helps fast screening of large compound libraries to identify potential hits for various biological targets without focusing on specific end points initially.25

  • Chemical Space Exploration: Aids in exploring the chemical space by identifying compounds with diverse structural features, to assist in lead optimization and scaffold hopping.120

Methodologies and tools
  • Pharmacophore Modeling: Identifies the essential features in a molecule that are crucial for binding to a specific target. Pharmacophore-based similarity assessment compares the pharmacophoric features of compounds rather than their entire structures.121 Tools such as LigandScout, MOE (Molecular Operating Environment), and Discovery Studio provide functionalities for pharmacophore-based similarity analysis.121123

  • 3D Structure Alignment: For compounds with known 3D structures, structural alignment algorithms can be used to overlay and compare the binding conformations of ligands in the active site of a target protein.124 Methods include maximum common substructure (MCS) alignment or flexible docking. Tools such as AutoDock, GOLD, and Schrödinger Suite offer 3D structure alignment and similarity analysis capabilities.125127

  • QSAR: QSAR models can be utilized to predict the activity of compounds against a specific target based on their structural descriptors.23 Similarity can be assessed by comparing the predicted activities of compounds using QSAR models.63 Tools like QSAR Toolbox, OPERA, and MOE provide QSAR models based on end point–specific similarity.122,128

  • Fingerprint-based Similarity: Fingerprints encode the structural features of compounds into binary vectors.33 Similarity between compounds is assessed by comparing their fingerprints using metrics such as Tanimoto coefficient or Euclidean distance.62 Popular fingerprint types include MACCS keys, ECFP (Extended Connectivity Fingerprints), and Morgan fingerprints.34,129 Tools like RDKit, Open Babel, and ChemAxon offer functionalities for fingerprint-based similarity analysis.64,130

  • Graph-based Similarity: Compounds can be represented as graphs, where atoms are nodes and bonds are edges.27 Graph similarity algorithms compare the topology of these graphs to assess structural similarity between compounds.34 Methods include graph kernel methods, graph edit distance, and maximum common subgraph algorithms.60 Tools such as NetworkX, igraph, and MOE provide graph-based similarity analysis capabilities.122

  • Fragment-based Similarity: Compounds are cut into smaller chemical fragments, and similarity is assessed based on the presence or absence of these fragments.131 Fragment-based similarity approaches offer a balance between molecular complexity and computational efficiency.132 Tools providing fragment-based similarity include Fragment Network, RDKit, and ChemAxon.133

Type of learning
  • Supervised learning: End point–specific similarity approaches apply supervised learning because they are tailored to a specific task (e.g., predicting activity against a particular target) and utilize labeled data or prior knowledge to guide the analysis.87

  • Operating mechanism: The algorithm learns from labeled data, where each example is associated with a target value or class label. Similarly, in end point–specific similarity, the analysis is guided by knowledge of the specific end point or biological activity of interest. The structural features considered in end point–specific similarity are chosen based on their relevance to the end point.63

  • Example methods: Pharmacophore and QSAR modeling explicitly incorporate information about the end point into the analysis. These methods learn to differentiate between compounds based on their similarity to known active compounds or their predicted activity against the target.63,121

  • Unsupervised learning: General similarity approaches apply unsupervised learning, operating in a data-driven manner without relying on labels or knowledge about a specific end point. They aim to uncover underlying relationships in the data without predefined targets or classes.41

  • Operating mechanism: The algorithm autonomously explores the structure of the data without explicit guidance from labeled examples. Similarly, general similarity methods compare compounds based on their overall structural features without specific reference to a particular end point, allowing for a comprehensive exploration of chemical space and identification of underlying patterns.38

  • Example methods: Fingerprint-based similarity and clustering algorithms delve into the inherent structure of the chemical space to discern patterns or similarities among the studied compounds. This analysis facilitates a deeper understanding of the relationships among the chemicals of a dataset.39,64

Note: MACCS, Molecular ACCess System; QSAR, quantitative structure–activity relationships.

Why Can General Similarity Not Be Used for End Point–Specific Applications?

End point–agnostic similarity measures, such as arbitrarily chosen general fingerprint-based or graph-based comparisons, provide a broad assessment of structural similarity between chemical compounds. While they are valuable for tasks like compound clustering, virtual screening, and chemical space exploration, we argue that they are not suitable for end point–specific purposes due to several reasons:

  • Relevance to End Point: End point–agnostic similarity methods do not prioritize the structural features that are specifically relevant to the end point of interest. They consider all structural aspects equally, regardless of their importance for the biological or chemical activity being studied. This can lead to a dilution of the signal relevant to the end point amid a myriad of other structural details.

  • Lack of Specificity: General similarity metrics do not account for the nuanced structural requirements of a particular end point. They may identify compounds as similar even if they differ significantly in their ability to interact with the target or exhibit the desired biological activity. This lack of specificity can result in false positives or false negatives when predicting compound activity.

  • Inability to Capture Pharmacophoric Features: General similarity methods often fail to capture the pharmacophoric features necessary for interaction with a specific target. These features, such as hydrogen bond donors/acceptors, aromatic rings, or hydrophobic regions, are crucial for binding affinity and biological activity. End point–specific similarity approaches, such as pharmacophore modeling, are designed to explicitly consider these features.

  • Risk of Overlooking Important Structural Motifs: End point–specific activities often rely on specific structural motifs or functional groups known to be critical for interaction with the target. End point–agnostic similarity methods may overlook these motifs if they are not prevalent in the dataset or if they are overshadowed by other, more abundant structural features. End point–specific approaches ensure that these important motifs are given appropriate weight in the similarity assessment.

  • Failure to Prioritize Key Interactions: In many cases, the biological activity of a compound is mediated by specific interactions with the target protein or receptor. General similarity measures may fail to prioritize compounds that share these key interactions, leading to inaccurate predictions of compound activity. End point–specific methods, such as 3D structure alignment or QSAR modeling, explicitly consider these interactions.

In summary, while end point–agnostic similarity methods are valuable for exploring overall structural relationships between compounds, we have shown that they are not suitable for end point–specific purposes due to their lack of specificity, inability to capture pharmacophoric features, and risk of overlooking important structural motifs and interactions. End point–specific similarity approaches are designed to address these limitations by focusing explicitly on the structural features relevant to the end point of interest.

How to Identify End Point–Specific Relevant Structural Features?

Identifying end point–specific relevant features for read-across and QSARs involves a systematic approach to selecting and prioritizing structural features that are most influential for the end point of interest. Methods and tools commonly used for this purpose include:

  • Literature Review: Conducting a comprehensive review of the scientific literature to identify known structural features or substructures associated with the end point of interest. This may involve studying published SAR (structure–activity relationship) studies, experimental data, and mechanistic insights.

  • Expert Knowledge: Consulting domain experts, such as medicinal chemists, toxicologists, or computational chemists, who possess expertise in the specific end point and can provide valuable insights into the relevant structural features.

  • Data Mining and Cheminformatics Tools: Utilize data mining and cheminformatics tools to analyze large databases of chemical compounds and associated biological activity data. These tools can help identify statistically significant structural patterns or substructures correlated with the end point. Tools such as RDKit, ChemAxon, and KNIME offer functionalities for data mining, substructure searching, and statistical analysis of chemical and biological data.134

  • Pharmacophore Modeling: Construct pharmacophore models based on known active compounds or ligands for the end point of interest. Pharmacophore models identify the essential features (e.g., hydrogen bond donors/acceptors, aromatic rings) required for binding to the target. Tools like LigandScout, MOE, and Discovery Studio provide capabilities for pharmacophore generation and analysis.121123

  • Fragment-Based Analysis: Decompose chemical compounds into fragments or chemical substructures and analyze their frequency and distribution among active and inactive compounds. Fragments that are overrepresented in active compounds may represent relevant structural features. Tools like Fragment Network and RDKit provide functionalities for fragment-based analysis and substructure mining.133

  • Machine Learning and Feature Importance Analysis: Employ machine learning algorithms, such as random forests or gradient boosting machines, to rank the importance of structural features in predicting the end point. Feature importance analysis techniques can identify the most influential descriptors or substructures. Machine learning libraries like scikit-learn (Sci-kit Learn Developers) in Python offer feature importance analysis methods for various algorithms.

By integrating these methods and tools, researchers can systematically identify and prioritize end point–specific relevant features for class-based studies, facilitating the development of predictive models and decision-making in chemical risk assessment and compound optimization.

Conclusions

This overview of class-based approaches outlined key concepts for supervised and unsupervised similarity methods and explored current applications of clustering and classification approaches. It addressed challenges and proposed solutions, along with advocating for future collaborations to promote better understanding, best practices, and increased use of CCAs in toxicity research and risk assessment. This serves two main purposes: educating users about the diverse CCAs and fostering enduring collaborations to tackle challenges associated with end point–specific similarity. The initial objective involved providing an in-depth overview of CCAs, emphasizing their relevance in scientific toxicity research and hazard identification—a pivotal step in risk assessment. Emphasis was placed on understanding the fundamental principles behind these techniques to ensure correct tool utilization and precise result interpretation.

The secondary objective was to emphasize the context-driven nature of similarity. Distinctions were drawn between end point–agnostic similarity, which is crucial for unsupervised analysis, and end point–specific similarity, which requires supervised selection of pertinent variables for accurate end point specificity. While combining supervised and unsupervised approaches is possible, it is crucial to exercise caution regarding the potential misuse of unsupervised analyses, especially when aiming for end point specificity. In the initial stages of selecting analogues for read-across, it is imperative to avoid using unsupervised methods based on general features denoting end point–agnostic similarity, such as starting with an arbitrarily selected fingerprint with a similarity metric. General similarity (visual or arbitrary)-based analogue selection does not inherently guarantee end point specificity, potentially leading to erroneous predictions.5 Analogues are also nontransferable across different studies because each end point has its own most influential features that define its underlying specific similarity.

This work inspired the creation of an automated workflow aimed at facilitating chemical grouping using supervised and unsupervised methods.10 This workflow was developed within the framework of the free and open-source KNIME environment.134 It implements the guidance and key takeaways discussed herein, featuring an intuitive, documented, and guided user-friendly graphical interface aiming to democratize the use of CCAs for nonexperts and simplifying their application for the wider community.

For a broader application of CCAs, we believe it is important to establish a sustained international consortium of collaborators. The recommended consortium role would involve leveraging expertise, developing roadmaps and standards, establishing optimal communication and result-sharing mechanisms, publishing state-of-the-science papers, and investing in technological advancements. Through interdisciplinary collaboration and implementation of these recommendations, we can enhance the utilization of CCAs across diverse fields.

Acknowledgments

The NIEHS hosted a Clustering and Classification workshop (https://www.niehs.nih.gov/news/events/pastmtg/2022/nams2022/index.cfm). General information from this workshop fed into this manuscript to provide an introduction and overview of the topic. We thank the workshop speakers and discussants for sharing their insights.

We thank the support from the National Institute of Environmental Health Sciences (NIEHS) and the Division of Translational Toxicology (DTT), particularly the DTT’s Consumer Products and Therapeutics Program Management Team. This work was supported by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences, Intramural Research project ZIA ES103379.

Conclusions and opinions are those of the individual authors and do not necessarily reflect the policies or views of EHP Publishing or the National Institute of Environmental Health Sciences.

References

  • 1.Cronin MTD. 2013. An introduction to chemical grouping, categories and read-across to predict toxicity. In: Chemical Toxicity Prediction. Cronin M, Madden J, Enoch S, Roberts D, eds. London, UK: RSC Publishing, 1–29. [Google Scholar]
  • 2.Zhu H, Bouhifd M, Donley E, Egnash L, Kleinstreuer N, Kroese ED, et al. 2016. Supporting read-across using biological data. ALTEX 33(2):167–182, PMID: 26863516, 10.14573/altex.1601252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Medina-Franco JL, Maggiora GM. 2013. Molecular similarity analysis. In: Chemoinformatics for Drug Discovery. Bajorath J, ed. Hoboken, NJ: John Wiley & Sons, Ltd, 343–399. [Google Scholar]
  • 4.Ballabio D, Cosio MS, Mannino S, Todeschini R. 2006. A chemometric approach based on a novel similarity/diversity measure for the characterisation and selection of electronic nose sensors. Anal Chim Acta 578(2):170–177, PMID: 17723709, 10.1016/j.aca.2006.06.067. [DOI] [PubMed] [Google Scholar]
  • 5.Ballabio D, Vasighi M, Filzmoser P. 2013. Effects of supervised self organising maps parameters on classification performance. Anal Chim Acta 765:45–53, PMID: 23410625, 10.1016/j.aca.2012.12.027. [DOI] [PubMed] [Google Scholar]
  • 6.Kuzmanovski I, Trpkovska M, Šoptrajanov B. 2005. Optimization of supervised self-organizing maps with genetic algorithms for classification of urinary calculi. J Mol Struct 744–747:833–838, 10.1016/j.molstruc.2005.01.059. [DOI] [Google Scholar]
  • 7.CPSC (US Consumer Product Safety Commission). 2014. Chronic Hazard Advisory Panel (CHAP) on Phthalates. Bethesda, MD: CPSC. https://www.cpsc.gov/chap [accessed 28 February 2024]. [Google Scholar]
  • 8.Maffini MV, Rayasam SDG, Axelrad DA, Birnbaum LS, Cooper C, Franjevic S, et al. 2023. Advancing the science on chemical classes. Environ Health 21(suppl 1):120, PMID: 36635752, 10.1186/s12940-022-00919-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.OECD (Organization for Economic Co-operation and Development). 2014. Guidance on Grouping of Chemicals, Second Edition. https://www.oecd.org/publications/guidance-on-grouping-of-chemicals-second-edition-9789264274679-en.htm [accessed 28 February 2024].
  • 10.Moreira-Filho JT. 2023. Chemical grouping workflow. https://github.com/NIEHS/Chemical-grouping-workflow [accessed 13 June 2024].
  • 11.Moreira-Filho JT. 2024. Chemical grouping workflow 0.7.0. KNIME Community Hub. https://hub.knime.com/-/spaces/-/∼AnmyNgAW4JMJ_gq4/ [accessed 13 June 2024]. [Google Scholar]
  • 12.Andersen ME, McMullen PD, Phillips MB, Yoon M, Pendse SN, Clewell HJ, et al. 2019. Developing context appropriate toxicity testing approaches using new alternative methods (NAMs). ALTEX 36(4):523–534, PMID: 31664457, 10.14573/altex.1906261. [DOI] [PubMed] [Google Scholar]
  • 13.Interagency Coordinating Committee on the Validation of Alternative Methods. 2018. A Strategic Roadmap for Establishing New Approaches to Evaluate the Safety of Chemicals and Medical Products in the United States. Research Triangle Park, NC: National Toxicology Program. 10.22427/NTP-ICCVAM-ROADMAP2018. [DOI] [Google Scholar]
  • 14.ECETOC (European Centre for Ecotoxicology and Toxicology of Chemicals). 2012. Category Approaches, Read-across, (Q)SAR. Brussels, Belgium: ECETOC. 10.1016/B978-0-12-386454-3.00505-4. [DOI] [Google Scholar]
  • 15.ECHA (European Chemicals Agency). 2011. The Use of Alternatives to Testing on Animals for the REACH Regulation 2011. Helsinki, Finland: ECHA. [Google Scholar]
  • 16.ECHA. 2012. Practical Guide 6. How to Report Read-across and Categories. Helsinki, Finland: ECHA. [Google Scholar]
  • 17.OECD. 2007. Guidance Document on the Validation of (Quantitative) Structure-Activity Relationship [(Q)SAR] Models. Paris, France: OECD. [Google Scholar]
  • 18.OECD. 2011. Report of the Workshop on Using Mechanistic Information in Forming Chemical Categories. Environment Health and Safety Publications, Series on Testing and Assessment No. 138, Report No. ENV/JM/MONO(2011) 8, JT03301985. Paris, France: OECD. [Google Scholar]
  • 19.Low Y, Sedykh A, Fourches D, Golbraikh A, Whelan M, Rusyn I, et al. 2013. Integrative chemical–biological read-across approach for chemical hazard classification. Chem Res Toxicol 26(8):1199–1208, PMID: 23848138, 10.1021/tx400110f. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Schultz TW, Amcoff P, Berggren E, Gautier F, Klaric M, Knight DJ, et al. 2015. A strategy for structuring and reporting a read-across prediction of toxicity. Regul Toxicol Pharmacol 72(3):586–601, PMID: 26003513, 10.1016/j.yrtph.2015.05.016. [DOI] [PubMed] [Google Scholar]
  • 21.Varnek A. 2020. Similarity and Diversity, University of Strasbourg, France. PowerPoint Presentation - ID:9526169. SlideServe. https://www.slideserve.com/armistead/similarity-and-diversity-alexandre-varnek-university-of-strasbourg-france-powerpoint-ppt-presentation [accessed 6 March 2024]. [Google Scholar]
  • 22.Arteca GA, Jammal VB, Mezey PG, Mezey PG. 1988. Shape group studies of molecular similarity and regioselectivity in chemical reactions. J Comput Chem 9(6):608–619, 10.1002/jcc.540090606. [DOI] [Google Scholar]
  • 23.Balaban AT. 1998. Topological and stereochemical molecular descriptors for databases useful in QSAR, similarity/dissimilarity and drug design. SAR QSAR Env Res 8:1–21. [Google Scholar]
  • 24.Hu Y, Bajorath J. 2010. Exploring target-selectivity patterns of molecular scaffolds. ACS Med Chem Lett 1(2):54–58, PMID: 24900176, 10.1021/ml900024v. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Krier M, Bret G, Rognan D. 2006. Assessing the scaffold diversity of screening libraries. J Chem Inf Model 46(2):512–524, PMID: 16562979, 10.1021/ci050352v. [DOI] [PubMed] [Google Scholar]
  • 26.Schneider G, Schneider P, Renner S. 2006. Scaffold-hopping: how far can you jump? QSAR Comb Sci 25(12):1162–1171, 10.1002/qsar.200610091. [DOI] [Google Scholar]
  • 27.Balaban AT. 1978. Chemical graphs. XXXII. Constitutional and steric isomers of substituted cycloalkanes. Croat Chem Acta 51:35–42. [Google Scholar]
  • 28.Todeschini R, Consonni V. 2009. Molecular Descriptors for Chemoinformatics. Weinheim, Germany: Wiley-VCH. [Google Scholar]
  • 29.Bahia MS, Kaspi O, Touitou M, Binayev I, Dhail S, Spiegel J, et al. 2023. A comparison between 2D and 3D descriptors in QSAR modeling based on bio‐active conformations. Mol Inform 42(4):e2200186, PMID: 36617991, 10.1002/minf.202200186. [DOI] [PubMed] [Google Scholar]
  • 30.Da Silva Veras L, Arakawa M, Funatsu K, Takahata Y. 2010. 2D and 3D QSAR studies of the receptor binding affinity of progestins. J Braz Chem Soc 21(5):872–881, 10.1590/S0103-50532010000500015. [DOI] [Google Scholar]
  • 31.ChemIntelligence. 2021. Machine learning descriptors for molecules. ChemIntelligence Blog. https://chemintelligence.com/blog/machine-learning-descriptors-molecules [accessed 8 March 2024]. [Google Scholar]
  • 32.Grisoni F, Consonni V, Todeschini R. 2018. Impact of molecular descriptors on computational models. Methods Mol Biol 1825:171–209, 10.1007/978-1-4939-8639-2_5. [DOI] [PubMed] [Google Scholar]
  • 33.Willett P. 2006. Similarity-based virtual screening using 2D fingerprints. Drug Discov Today 11(23–24):1046–1053, PMID: 17129822, 10.1016/j.drudis.2006.10.005. [DOI] [PubMed] [Google Scholar]
  • 34.Anderson S. 1984. Graphical representation of molecules and substructure-search queries in MACCS. J Mol Graph 2(3):83–90, 10.1016/0263-7855(84)80060-0. [DOI] [Google Scholar]
  • 35.Bajusz D, Rácz A, Héberger K. 2017. Chemical data formats, fingerprints, and other molecular descriptions for database analysis and searching. In: Comprehensive Medicinal Chemistry III. Amsterdam, The Netherlands: Elsevier, 329–378. [Google Scholar]
  • 36.Raymond JW, Willett P. 2002. Effectiveness of graph-based and fingerprint-based similarity measures for virtual screening of 2D chemical structure databases. J Comput Aided Mol Des 16(1):59–71, PMID: 12197666, 10.1023/a:1016387816342. [DOI] [PubMed] [Google Scholar]
  • 37.Willett P, Barnard JM, Downs GM. 1998. Chemical similarity searching. J Chem Inf Comput Sci 38(6):983–996, 10.1021/ci9800211. [DOI] [Google Scholar]
  • 38.Tetko IV, Villa AEP. 1995. Unsupervised and supervised learning: cooperation toward a common goal. In: International Conference on Artificial Neural Networks NEURONIMES95. 9–13 October 1995. Paris, France: EC2 & Cie, 105–110. [Google Scholar]
  • 39.Whitley DC, Ford MG, Livingstone DJ. 2000. Unsupervised forward selection: a method for eliminating redundant variables. J Chem Inf Comput Sci 40(5):1160–1168, PMID: 11045809, 10.1021/ci000384c. [DOI] [PubMed] [Google Scholar]
  • 40.Stahl M, Mauser H. 2005. Database clustering with a combination of fingerprint and maximum common substructure methods. J Chem Inf Model 45(3):542–548, PMID: 15921444, 10.1021/ci050011h. [DOI] [PubMed] [Google Scholar]
  • 41.Piraino P, Ricciardi A, Salzano G, Zotta T, Parente E. 2006. Use of unsupervised and supervised artificial neural networks for the identification of lactic acid bacteria on the basis of SDS-PAGE patterns of whole cell proteins. J Microbiol Methods 66(2):336–346, PMID: 16480784, 10.1016/j.mimet.2005.12.007. [DOI] [PubMed] [Google Scholar]
  • 42.Berrueta LA, Alonso-Salces RM, Héberger K. 2007. Supervised pattern recognition in food analysis. J Chromatogr A 1158(1–2):196–214, PMID: 17540392, 10.1016/j.chroma.2007.05.024. [DOI] [PubMed] [Google Scholar]
  • 43.González AG. 2007. Use and misuse of supervised pattern recognition methods for interpreting compositional data. J Chromatogr A 1158(1–2):215–225, PMID: 17350028, 10.1016/j.chroma.2007.02.091. [DOI] [PubMed] [Google Scholar]
  • 44.Mazzatorta P, Vračko M, Jezierska A, Benfenati E. 2003. Modeling toxicity by using supervised kohonen neural networks. J Chem Inf Comput Sci 43(2):485–492, PMID: 12653512, 10.1021/ci0256182. [DOI] [PubMed] [Google Scholar]
  • 45.Broccatelli F, Mannhold R, Moriconi A, Giuli S, Carosati E. 2012. QSAR modeling and data mining link torsades de pointes risk to the interplay of extent of metabolism, active transport, and hERG liability. Mol Pharm 9(8):2290–2301, PMID: 22742658, 10.1021/mp300156r. [DOI] [PubMed] [Google Scholar]
  • 46.Balaban AT, Kier LB, Joshi N. 1992. Correlations between chemical structure and normal boiling points of acyclic ethers, peroxides, acetals, and their sulfur analogues. J Chem Inf Comput Sci 32(3):237–244, 10.1021/ci00007a011. [DOI] [Google Scholar]
  • 47.Ashenhurst J. 2010. 3 Trends That Affect Boiling Points. Master Organic Chemistry. https://www.masterorganicchemistry.com/2010/10/25/3-trends-that-affect-boiling-points/ [accessed 29 February 2024]. [Google Scholar]
  • 48.Liu S. 2014. Where does the electron go? The nature of ortho/para and meta group directing in electrophilic aromatic substitution. J Chem Phys 141(19):194109, PMID: 25416876, 10.1063/1.4901898. [DOI] [PubMed] [Google Scholar]
  • 49.Agrafiotis DK, Cedeño W. 2002. Feature selection for structure-activity correlation using binary particle swarms. J Med Chem 45(5):1098–1107, PMID: 11855990, 10.1021/jm0104668. [DOI] [PubMed] [Google Scholar]
  • 50.Baumann K. 2005. Chance correlation in variable subset regression: influence of the objective function, the selection mechanism, and ensemble averaging. QSAR Comb Sci 24(9):1033–1046, 10.1002/qsar.200530134. [DOI] [Google Scholar]
  • 51.Rovida C, Escher SE, Herzler M, Bennekou SH, Kamp H, Kroese DE, et al. 2021. NAM-supported read-across: from case studies to regulatory guidance in safety assessment. ALTEX 38(1):140–150, PMID: 33452529, 10.14573/altex.2010062. [DOI] [PubMed] [Google Scholar]
  • 52.Consonni V, Ballabio D, Todeschini R. 2009. Comments on the definition of the Q2 parameter for QSAR validation. J Chem Inf Model 49(7):1669–1678, PMID: 19527034, 10.1021/ci900115y. [DOI] [PubMed] [Google Scholar]
  • 53.Varmuza K, Filzmoser P, Dehmer M. 2013. Multivariate linear QSPR/QSAR models: rigorous evaluation of variable selection for PLS. Comput Struct Biotechnol J 5:e201302007, PMID: 24688700, 10.5936/csbj.201302007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Bender A, Jenkins JL, Scheiber J, Sukuru SCK, Glick M, Davies JW. 2009. How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J Chem Inf Model 49(1):108–119, PMID: 19123924, 10.1021/ci800249s. [DOI] [PubMed] [Google Scholar]
  • 55.Atwood ST, Lunn RM, Garner SC, Jahnke GD. 2019. New perspectives for cancer hazard evaluation by the report on carcinogens: a case study using read-across methods in the evaluation of haloacetic acids found as water disinfection by-products. Environ Health Perspect 127(12):125003, PMID: 31854200, 10.1289/EHP5672. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Cash GG. 1999. Prediction of physicochemical properties from Euclidean distance methods based on electrotopological state indices. Chemosphere 39(14):2583–2591, 10.1016/S0045-6535(99)00158-7. [DOI] [Google Scholar]
  • 57.Mahalanobis PC. 1936. On the Generalized Distance in Statistics. Calcutta, India: National Institute of Sciences of India, 49–55. [Google Scholar]
  • 58.Sahigara F, Mansouri K, Ballabio D, Mauri A, Consonni V, Todeschini R. 2012. Comparison of different approaches to define the applicability domain of QSAR models. Molecules 17(5):4791–4810, PMID: 22534664, 10.3390/molecules17054791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Todeschini R, Consonni V, Pavan M. 2004. A distance measure between models: a tool for similarity/diversity analysis of model populations. Chemom Intell Lab Syst 70(1):55–61, 10.1016/j.chemolab.2003.10.003. [DOI] [Google Scholar]
  • 60.Basak SC, Magnuson VR, Niemi GJ, Regal RR. 1988. Determining structural similarity of chemicals using graph-theoretic indices. Disc Appl Math 19(1–3):17–44, 10.1016/0166-218X(88)90004-2. [DOI] [Google Scholar]
  • 61.Fechner U, Schneider G. 2004. Evaluation of distance metrics for ligand-based similarity searching. ChemBioChem 5(4):538–540, PMID: 15185379, 10.1002/cbic.200300812. [DOI] [PubMed] [Google Scholar]
  • 62.Fligner MA, Verducci JS, Blower PE. 2002. A modification of the Jaccard–Tanimoto similarity index for diverse selection of chemical compounds using binary strings. Technometrics 44(2):110–119, 10.1198/004017002317375064. [DOI] [Google Scholar]
  • 63.Glen RC, Adams SE. 2006. Similarity metrics and descriptor spaces: which combinations to choose? QSAR Comb Sci 25(12):1133–1142, 10.1002/qsar.200610097. [DOI] [Google Scholar]
  • 64.Raymond JW, Blankley CJ, Willett P. 2003. Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures. J Mol Graph Model 21(5):421–433, PMID: 12543138, 10.1016/s1093-3263(02)00188-2. [DOI] [PubMed] [Google Scholar]
  • 65.Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P. 2012. Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model 52(11):2884–2901, PMID: 23078167, 10.1021/ci300261r. [DOI] [PubMed] [Google Scholar]
  • 66.Guha R, Van Drie JH. 2008. Structure–activity landscape index: identifying and quantifying activity cliffs. J Chem Inf Model 48(3):646–658, PMID: 18303878, 10.1021/ci7004093. [DOI] [PubMed] [Google Scholar]
  • 67.Maggiora GM. 2006. On outliers and activity cliffs—why QSAR often disappoints. J Chem Inf Model 46(4):1535, PMID: 16859285, 10.1021/ci060117s. [DOI] [PubMed] [Google Scholar]
  • 68.Medina-Franco JL, Martínez-Mayorga K, Bender A, Marín RM, Giulianotti MA, Pinilla C, et al. 2009. Characterization of activity landscapes using 2D and 3D similarity methods: consensus activity cliffs. J Chem Inf Model 49(2):477–491, PMID: 19434846, 10.1021/ci800379q. [DOI] [PubMed] [Google Scholar]
  • 69.van Tilborg D, Alenicheva A, Grisoni F. 2022. Exposing the limitations of molecular machine learning with activity cliffs. J Chem Inf Model 62(23):5938–5951, PMID: 36456532, 10.1021/acs.jcim.2c01073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Stumpfe D, Hu H, Bajorath J. 2019. Evolving concept of activity cliffs. ACS Omega 4(11):14360–14368, PMID: 31528788, 10.1021/acsomega.9b02221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Tamura S, Miyao T, Bajorath J. 2023. Large-scale prediction of activity cliffs using machine and deep learning methods of increasing complexity. J Cheminform 15(1):4, PMID: 36611204, 10.1186/s13321-022-00676-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Medina-Franco JL. 2012. Scanning structure-activity relationships with structure-activity similarity and related maps: from consensus activity cliffs to selectivity switches. J Chem Inf Model 52(10):2485–2493, PMID: 22989212, 10.1021/ci300362x. [DOI] [PubMed] [Google Scholar]
  • 73.Iyer P, Stumpfe D, Vogt M, Bajorath J, Maggiora GM. 2013. Activity landscapes, information theory, and structure - activity relationships. Mol Inform 32(5–6):421–430, PMID: 27481663, 10.1002/minf.201200120. [DOI] [PubMed] [Google Scholar]
  • 74.Kumar A, Zhang KYJ. 2018. Advances in the development of shape similarity methods and their application in drug discovery. Front Chem 6:315, PMID: 30090808, 10.3389/fchem.2018.00315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Brown RD, Martin YC. 1997. The information content of 2D and 3D structural descriptors relevant to ligand–receptor binding. J Chem Inf Comput Sci 37(1):1–9, 10.1021/ci960373c. [DOI] [Google Scholar]
  • 76.Dubot JP. 1993. 2D and 3D lipophilicity parameters in QSAR. In: Trends in QSAR and Molecular Modelling 92. Leiden, The Netherlands: ESCOM, 93–100. [Google Scholar]
  • 77.Chiesa L, Sick E, Kellenberger E. 2023. Predicting the duration of action of β2-adrenergic receptor agonists: ligand and structure-based approaches. Mol Inform 42(12):e202300141, PMID: 37872120, 10.1002/minf.202300141. [DOI] [PubMed] [Google Scholar]
  • 78.Kellenberger E, Rodrigo J, Muller P, Rognan D. 2004. Comparative evaluation of eight docking tools for docking and virtual screening accuracy. Proteins Struct Funct Genet 57(2):225–242, PMID: 15340911, 10.1002/prot.20149. [DOI] [PubMed] [Google Scholar]
  • 79.Suttr J. 1995. Selection of molecular descriptors for quantitative structure–activity relationships. In: Adaption of Simulated Annealing to Chemical Optimization Problems. Amsterdam, The Netherlands: Elsevier, 111–132. [Google Scholar]
  • 80.Kleinstreuer NC, Dix DJ, Houck KA, Kavlock RJ, Knudsen TB, Martin MT, et al. 2013. In vitro perturbations of targets in cancer hallmark processes predict rodent chemical carcinogenesis. Toxicol Sci 131(1):40–55, PMID: 23024176, 10.1093/toxsci/kfs285. [DOI] [PubMed] [Google Scholar]
  • 81.Reif DM, Martin MT, Tan SW, Houck KA, Judson RS, Richard AM, et al. 2010. Endocrine profiling and prioritization of environmental chemicals using ToxCast data. Environ Health Perspect 118(12):1714–1720, PMID: 20826373, 10.1289/ehp.1002180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Moustakas H, Date MS, Kumar M, Schultz TW, Liebler DC, Penning TM, et al. 2022. An end point-specific framework for read-across analog selection for human health effects. Chem Res Toxicol 35(12):2324–2334, PMID: 36458907, 10.1021/acs.chemrestox.2c00286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Ballabio D, Vasighi M, Consonni V, Kompany-Zareh M. 2011. Genetic algorithms for architecture optimisation of counter-propagation artificial neural networks. Chemom Intell Lab Syst 105(1):56–64, 10.1016/j.chemolab.2010.10.010. [DOI] [Google Scholar]
  • 84.Blower PE, Fligner MA, Verducci JS, Bjoraker J. 2002. On combining recursive partitioning and simulated annealing to detect groups of biologically active compounds. J Chem Inf Comput Sci 42(2):393–404, PMID: 11911709, 10.1021/ci0101049. [DOI] [PubMed] [Google Scholar]
  • 85.Sun L, Xie Y, Song X, Wang J, Yu R. 1994. Cluster analysis by simulated annealing. Comput Chem 18(2):103–108, 10.1016/0097-8485(94)85003-8. [DOI] [Google Scholar]
  • 86.Leardi R. 2001. Genetic algorithms in chemometrics and chemistry: a review. J Chemom 15(7):559–569, 10.1002/cem.651. [DOI] [Google Scholar]
  • 87.Wongravee K, Lloyd GR, Silwood CJ, Grootveld M, Brereton RG. 2010. Supervised self organizing maps for classification and determination of potentially discriminatory variables: illustrated by application to nuclear magnetic resonance metabolomic profiling. Anal Chem 82(2):628–638, PMID: 20038089, 10.1021/ac9020566. [DOI] [PubMed] [Google Scholar]
  • 88.Reutlinger M, Schneider G. 2012. Nonlinear dimensionality reduction and mapping of compound libraries for drug discovery. J Mol Graph Model 34:108–117, PMID: 22326864, 10.1016/j.jmgm.2011.12.006. [DOI] [PubMed] [Google Scholar]
  • 89.Yamamoto H, Yamaji H, Abe Y, Harada K, Waluyo D, Fukusaki E, et al. 2009. Dimensionality reduction for metabolome data using PCA, PLS, OPLS, and RFDA with differential penalties to latent variables. Chemom Intell Lab Syst 98(2):136–142, 10.1016/j.chemolab.2009.05.006. [DOI] [Google Scholar]
  • 90.Ballabio D, Todeschini R. 2009. Multivariate classification for qualitative analysis. In: Infrared Spectroscopy for Food Quality Analysis and Control. Sun D-W, ed. Burlington, MA: Academic Press, 83–104. [Google Scholar]
  • 91.Jouan-Rimbaud D, Bouveresse E, Massart DL, De Noord OE. 1999. Detection of prediction outliers and inliers in multivariate calibration. Anal Chim Acta 388(3):283–301, 10.1016/S0003-2670(98)00626-6. [DOI] [Google Scholar]
  • 92.Ballabio D, Grisoni F, Todeschini R. 2018. Multivariate comparison of classification performance measures. Chemom Intell Lab Syst 174:33–44, 10.1016/j.chemolab.2017.12.004. [DOI] [Google Scholar]
  • 93.Alessandri S, Cimato A, Modi G, Crescenzi A, Caselli S, Tracchi S, et al. 1997. Univariate models to classify Tuscan virgin olive oils by zone. Riv Ital Sostanze Grasse 74:155–163. [Google Scholar]
  • 94.Ballabio D, Consonni V, Costa F. 2012. Relationships between apple texture and rheological parameters by means of multivariate analysis. Chemom Intell Lab Syst 111(1):28–33, 10.1016/j.chemolab.2011.11.002. [DOI] [Google Scholar]
  • 95.Brown PJ. 1982. Multivariate calibration. J Roy Stat Soc Ser B 44(3):287–308, 10.1111/j.2517-6161.1982.tb01209.x. [DOI] [Google Scholar]
  • 96.Johnon R. A W. 1992. Applied Multivariate Analysis. Englewood Cliffs, NJ: Prentice-Hall. [Google Scholar]
  • 97.Krzanowki WJ. 1988. Principles of Multivariate Analysis. Oxford, UK: Oxford University Press. [Google Scholar]
  • 98.Maria KV. 1988. Multivariate Analysis. London, UK: Academic Press. [Google Scholar]
  • 99.Lester C, Byrd E, Shobair M, Yan G. 2023. Quantifying analogue suitability for SAR-Based read-across toxicological assessment. Chem Res Toxicol 36(2):230–242, PMID: 36701522, 10.1021/acs.chemrestox.2c00311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Ballabio D, Consonni V, Mauri A, Todeschini R. 2010. Canonical measure of correlation (CMC) and canonical measure of distance (CMD) between sets of data. Part 3. Variable selection in classification. Anal Chim Acta 657(2):116–122, PMID: 20005322, 10.1016/j.aca.2009.10.033. [DOI] [PubMed] [Google Scholar]
  • 101.De Lucia FC Jr., Gottfried JL. 2011. Influence of variable selection on partial least squares discriminant analysis models for explosive residue classification. Spectrochim Acta Part B At Spectrosc 66(2):122–128, 10.1016/j.sab.2010.12.007. [DOI] [Google Scholar]
  • 102.Eklund M, Norinder U, Boyer S, Carlsson L. 2014. Choosing feature selection and learning algorithms in QSAR. J Chem Inf Model 54(3):837–843, PMID: 24460242, 10.1021/ci400573c. [DOI] [PubMed] [Google Scholar]
  • 103.Leardi R, Boggia R, Terrile M. 1992. Genetic algorithms as a strategy for feature selection. J Chemom 6(5):267–281, 10.1002/cem.1180060506. [DOI] [Google Scholar]
  • 104.Bender A, Schneider N, Segler M, Patrick Walters W, Engkvist O, Rodrigues T. 2022. Evaluation guidelines for machine learning tools in the chemical sciences. Nat Rev Chem 6(6):428–442, PMID: 37117429, 10.1038/s41570-022-00391-9. [DOI] [PubMed] [Google Scholar]
  • 105.Jaworska J, Nikolova-Jeliazkova N, Aldenberg T. 2005. QSAR applicability domain estimation by projection of the training set in descriptor space: a review. Altern Lab Anim 33(5):445–459, PMID: 16268757, 10.1177/026119290503300508. [DOI] [PubMed] [Google Scholar]
  • 106.Sushko I, Novotarskyi S, Körner R, Pandey AK, Cherkasov A, Li J, et al. 2010. Applicability domains for classification problems: benchmarking of distance to models for Ames mutagenicity set. J Chem Inf Model 50(12):2094–2111, PMID: 21033656, 10.1021/ci100253r. [DOI] [PubMed] [Google Scholar]
  • 107.Margaria T, Steffen B. 2008. Agile IT: thinking in user-centric models. In: Leveraging Applications of Formal Methods, Verification and Validation, vol. 17. Margaria T, Steffen B, eds. Berlin, Germany: Springer; 490–502. [Google Scholar]
  • 108.Fourches D, Muratov E, Tropsha A. 2010. Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50(7):1189–1204, PMID: 20572635, 10.1021/ci100176x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.Helma C, Kramer S, Pfahringer B, Gottmann E. 2000. Data quality in predictive toxicology: identification of chemical structures and calculation of chemical properties. Environ Health Perspect 108(11):1029–1033, PMID: 11102292, 10.1289/ehp.001081029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Richarz AN. 2019. Challenges, opportunities and perspectives. In: Big Data in Predictive Toxicology. Neagu D, Richarz A-N, eds. London, UK: RSC Publishing, 1–37. [Google Scholar]
  • 111.ECHA. 2017. Read-Across Assessment Framework (RAAF). Helsinki, Finland: ECHA. https://data.europa.eu/doi/10.2823/619212 [accessed 29 February 2024]. [Google Scholar]
  • 112.Nikolova N, Jaworska J. 2003. Approaches to measure chemical similarity–a review. QSAR Comb Sci 22(9–10):1006–1026, 10.1002/qsar.200330831. [DOI] [Google Scholar]
  • 113.Benfenati E, Roncaglioni A, Petoumenou MI, Cappelli CI, Gini G. 2015. Integrating QSAR and read-across for environmental assessment. SAR QSAR Environ Res 26(7–9):605–618, PMID: 26535447, 10.1080/1062936X.2015.1078408. [DOI] [PubMed] [Google Scholar]
  • 114.OECD. 2023. (Q)SAR Assessment Framework: Guidance for the Regulatory Assessment of (Quantitative) Structure - Activity Relationship Models. Paris, France: OECD. [Google Scholar]
  • 115.EPA (US Environmental Protection Agency). 2020. New Approach Methods Work Plan. https://www.epa.gov/chemical-research/new-approach-methods-work-plan [accessed 14 October 2020].
  • 116.Lizarraga LE, Suter GW, Lambert JC, Patlewicz G, Zhao JQ, Dean JL, et al. 2023. Advancing the science of a read-across framework for evaluation of data-poor chemicals incorporating systematic and new approach methods. Regul Toxicol Pharmacol 137:105293, PMID: 36414101, 10.1016/j.yrtph.2022.105293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117.Patlewicz G, Shah I. 2023. Towards systematic read-across using generalised read-across (GenRA). Comput Toxicol 25:100258, PMID: 37693774, 10.1016/j.comtox.2022.100258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 118.NASEM (National Academies of Sciences, Engineering, and Medicine). 2019. A Class Approach to Hazard Assessment of Organohalogen Flame Retardants. Washington, DC: National Academies Press. 10.17226/25412. [DOI] [PubMed] [Google Scholar]
  • 119.Autl V. 1983. Features and problems of practical drug design. In: Steric Effects in Drug Design, Topics in Current Chemistry, vol. 11.0. Berlin, Germany: Springer-Verlag, 7–19. [Google Scholar]
  • 120.Burdett JK. 1995. Topological aspects of chemical bonding and structure explored through the method of moments. J Mol Struct Theochem 336(2–3):115–136, 10.1016/0166-1280(94)04073-2. [DOI] [Google Scholar]
  • 121.Wolber G, Langer T. 2005. LigandScout: 3-D pharmacophores derived from protein-bound ligands and their use as virtual screening filters. J Chem Inf Model 45(1):160–169, PMID: 15667141, 10.1021/ci049885e. [DOI] [PubMed] [Google Scholar]
  • 122.MOE (Molcular Oprating Environmnt CCG). 1988. Laplacian matrices of graphs. In: MATH/CHEM/COMP. Amsterdam, The Netherlands: Elsevier, 1–8. [Google Scholar]
  • 123.Dassault Systèmes. 2023. Discovery Studio. https://www.3ds.com/products/biovia/discovery-studio [accessed 13 June 2024].
  • 124.Berglund A, De Rosa MC, Wold S. 1997. Alignment of flexible molecules at their receptor site using 3D descriptors and hi-PCA. J Comput Aid Mol Des 11:601–612. [DOI] [PubMed] [Google Scholar]
  • 125.Goodsell DS, Morris GM, Olson AJ. 1996. Automated docking of flexible ligands: applications of AutoDock. J Mol Recognit JMR 9(1):1–5, PMID: 8723313, . [DOI] [PubMed] [Google Scholar]
  • 126.Verdonk ML, Cole JC, Hartshorn MJ, Murray CW, Taylor RD. 2003. Improved protein–ligand docking using GOLD. Proteins Struct Funct Bioinforma 52(4):609–623, PMID: 12910460, 10.1002/prot.10465. [DOI] [PubMed] [Google Scholar]
  • 127.Schrödinger, LLC. 2012. Small-Molecule Drug Discovery Suite. https://www.schrodinger.com/ [accessed 26 July 2024].
  • 128.Mansouri K, Grulke CM, Judson RS, Williams AJ. 2018. OPERA models for predicting physicochemical properties and environmental fate endpoints. J Cheminform 10(1):10, PMID: 29520515, 10.1186/s13321-018-0263-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 129.Morgan HL. 1965. The generation of a unique machine description for chemical structures – a technique developed at chemical abstracts service. J Chem Doc 5(2):107–113, 10.1021/c160017a018. [DOI] [Google Scholar]
  • 130.O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. 2011. Open babel: an open chemical toolbox. J Cheminform 3(10):33, PMID: 21982300, 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131.Bath PA, Morris CA, Willet P. 1993. Effect of standardization on fragment-based measures of structural similarity. J Chemom 7(6):543–550, 10.1002/cem.1180070607. [DOI] [Google Scholar]
  • 132.Batista J, Bajorath J. 2007. Chemical database mining through entropy-based molecular similarity assessment of randomly generated structural fragment populations. J Chem Inf Model 47(1):59–68, PMID: 17238249, 10.1021/ci600377m. [DOI] [PubMed] [Google Scholar]
  • 133.Informatics Matters. 2024. Fragment Network. http://informaticsmatters.com/pages/fragment_network.html [accessed 13 June 2024].
  • 134.Berthold MR, Cebron N, Dill F, Di Fatta G, Gabriel TR, Georg F, et al. 2008. KNIME: the Konstanz information miner. In: Data Analysis, Machine Learning and Applications: Proceedings of the 31st Annual Conference of the Gesellschaft Für Klassifikation e.V., Albert-Ludwigs-Universität Freiburg. Preisach C, Burkhardt H, Schmidt-Thieme L, Decker R, eds. 7–9 March 2007. Berlin, Germany: Springer, 319–326. [Google Scholar]

Articles from Environmental Health Perspectives are provided here courtesy of National Institute of Environmental Health Sciences

RESOURCES