Skip to main content
Toxicological Sciences logoLink to Toxicological Sciences
. 2025 Sep 30;208(2):269–278. doi: 10.1093/toxsci/kfaf135

Beyond QSARs: Quantitative Knowledge–Activity Relationships (QKARs) for enhanced drug toxicity prediction

Ting Li 1,#, Yanyan Qu 2,3,#, Alexander Chen 4, Shraddha Thakkar 5, Dongying Li 6,, Weida Tong 7
PMCID: PMC12646586  PMID: 41025529

Abstract

Computational toxicology plays an important role in risk assessment and drug safety. The field has been traditionally dominated by Quantitative Structure–Activity Relationships (QSARs), which predict toxicological effects based solely on chemical structure. Although QSARs have achieved successes, their structure reliance limits drug toxicity predictions, where small structural modifications may cause major toxicity changes. Advances in artificial intelligence (AI), especially text embedding and generative AI, provide an opportunity to enhance toxicity predictions by leveraging broader chemical knowledge and its integration with structural data. In this study, we propose a novel framework, Quantitative Knowledge–Activity Relationships (QKARs), which predicts toxicity using domain-specific knowledge. We developed QKAR models for two drug toxicity endpoints, drug-induced liver injury (DILI) and drug-induced cardiotoxicity (DICT), using three different knowledge representations with varying levels of knowledge. The representations based on comprehensive knowledge of the drugs yielded better prediction than those with simpler knowledge. Five machine learning algorithms of distinct complexity were applied in QKAR models, and we observed little association between model complexity and performance. Further, we evaluated QKARs against QSARs on the same endpoints using identical datasets. We found that QKARs consistently outperformed QSARs for DILI and DICT. Notably, QKARs demonstrated better capability than QSARs in differentiating drugs with similar structures but different liver toxicity profiles. We also investigated integrating knowledge-based and structure-based representations, Q(K + S)ARs, for further enhanced prediction accuracy. Our findings demonstrate the potential of QKARs as a robust alternative to QSARs, offering additional opportunities in drug toxicity assessments by leveraging both domain-specific knowledge and structural data.

Keywords: computational toxicology, Quantitative Knowledge–Activity Relationships (QKARs), drug-induced cardiotoxicity, drug-induced liver injury, Quantitative Structure–Activity Relationships (QSARs), New Approach Methodologies (NAMs)


Computational toxicology focuses on using in silico approaches to predict the toxic effects of chemicals on living organisms; it has played an increasing role in toxicological research and drug safety studies (Kavlock et al. 2008; Muster et al. 2008; Li et al. 2021, 2023; Sinha et al. 2023; Haßmann et al. 2024; Li et al. 2024; Qu et al. 2025), particularly with the advancement in artificial intelligence (AI) and machine learning (ML). The use of computational approaches is also encouraged by regulatory agencies to support safety assessment. For example, in April 2025, the US Food and Drug Administration (FDA) released the FDA’s Roadmap to Reducing Animal Testing in Preclinical Safety Studies  https://www.fda.gov/files/newsroom/published/roadmap_to_reducing_animal_testing_in_preclinical_safety_studies.pdf, which emphasizes the adoption of New Approach Methodologies (NAMs) in safety evaluation of consumer products, including AI models. This roadmap is further supported by the FDA Modernization Act 3.0 https://www.congress.gov/bill/119th-congress/senate-bill/355, indicating FDA's broader commitment to modernizing toxicological assessments and reducing reliance on animal use.

Since its inception, computational toxicology has been dominated by Quantitative Structure–Activity Relationship (QSAR) approaches, which are based on the principle that a chemical’s structure determines its biological or toxicological activity (Tropsha et al. 2024). Accordingly, QSAR models widely employ ML to predict toxicity from chemical structures represented by chemical descriptors (Popova et al. 2018; Muratov et al. 2020; Li et al. 2021; Pérez Santín et al. 2021; Chen et al. 2023; Moret et al. 2023). Although QSARs have shown considerable success in toxicology applications (Nantasenamat et al. 2010), their exclusive reliance on chemical structures has limited the scope of their application. This shortcoming is specifically amplified in drug toxicity assessment, where minor structural modifications may result in significant changes in toxicity. For example, ibuprofen and ibufenac differ by only a single methyl group, but ibuprofen is generally considered a safe drug (Rainsford 2009), whereas ibufenac was withdrawn from the market due to its severe hepatotoxicity (Goldkind and Laine 2006). Many of these “drug-pair” cases highlight the need for more informative chemical representations that effectively capture features relevant to biological activity.

Advancements in AI and ML have provided essential capabilities to enhance computational toxicology. One of the most significant advancements in AI is the rise of generative AI (GenAI), which has shown remarkable abilities to not only understand text documents but also other types of data (Cao et al. 2023; Fui-Hoon Nah et al. 2023). A prominent example is Large Language Models (LLMs) such as ChatGPT, which has demonstrated the ability to summarize complex documents (Liu et al. 2024; Ying et al. 2024), pass medical exams (Mbakwe et al. 2023; Wang et al. 2023), and reason over vast and diverse knowledge bases (Bang et al. 2023). In addition, ChatGPT has been applied in various domains rich in contextual information, including those involving drugs and chemicals, with knowledge covering mechanisms of action, metabolic pathways, and off-target interactions (Zheng et al. 2023; Pradhan et al. 2024). Importantly, such knowledge can be converted to numerical vectors, known as text embeddings, which can then serve as input features for various downstream modeling tasks.

We hypothesized that this embedded domain knowledge can be leveraged for enhanced drug toxicity prediction. When combined with traditional structural data used in QSARs, this approach may enable the development of more accurate, mechanistically informed toxicity prediction. Therefore, we proposed Quantitative Knowledge–Activity Relationships (QKARs), which predict drug toxicity using domain-specific knowledge derived from LLMs through text embedding. We developed QKAR models for two drug toxicity endpoints, cardiotoxicity and liver toxicity, using different knowledge representations varying in the scope and depth of knowledge. We then evaluated the predictive performance of QKARs against traditional QSARs using identical datasets and the same development framework. Comparative analyses were also conducted with a range of ML algorithms with varying levels of complexity. Furthermore, we explored the impact of integrating knowledge-based and structure-based representations on prediction performance.

Materials and methods

Datasets

DILIst (Thakkar et al. 2020) and DICTrank (Qu et al. 2023) were the largest datasets with annotation for drug-induced liver injury (DILI) and drug-induced cardiotoxicity (DICT) primarily or completely based on FDA drug labeling documents, respectively. These datasets were used to develop all the models in this study. As biologics, mixtures, organometallics, and inorganics are unsuitable for QSAR modeling, they were removed from both datasets prior to model development. QKAR and QSAR models were developed using identical datasets to ensure a fair comparison in prediction performance.

The datasets were split for model training and testing based on drug approval years to simulate the real-world application. The training set consisted of drugs approved in earlier years, whereas the test set included those approved later, enabling models developed on earlier drugs to prospectively predict the toxicity of subsequently approved drugs. For DILI, the DILIst data were split based on the drug approval year of 1997, as shown in our previous study (Li et al. 2021), yielding 753 drugs (455 positive and 298 negative for DILI) for training and 249 drugs (149 positive and 100 negative for DILI) for testing. Similarly, the DICTrank data were split at the year 2005, in accordance with our previous study on DICT (Qu et al. 2025), which resulted in a training set of 621 drugs (485 positive and 136 negative for DICT) and a test set of 303 drugs (193 positive and 110 negative).

Generation of knowledge representation for QKARs

The knowledge representation for QKAR model development involved two steps. First, the knowledge summary for a drug was generated using GPT-4o with two different prompts exhibiting varying levels of specificity in instructions. The prompts are shown in Table 1. Next, the knowledge summary or the name of a drug was embedded to generate the knowledge representation using the text-embedding-3-large model, which is a state-of-the-art transformer-based method to convert textual descriptions into high-dimensional vector representations (with a vector length of 3072 dimensions). This embedding model was trained with a technique “Matryoshka Representation Learning” on a diverse and extensive corpus; it captures semantic relationships, contextual details, and syntactic structures, making it effective for tasks such as classification, clustering, and information retrieval (Kusupati et al. 2022). The model not only analyzes individual words but also examines the context in which those words are used, resulting in more accurate and meaningful vector representations. As a result, three knowledge representations (named DrugName, SimpleTox, and PharmTox, respectively) were generated and used for subsequent model development.

Table 1.

Knowledge representations generated using three different sets of input.a

Representation type Input/prompt
DrugName [Drug Name] (e.g. acetaminophen)
SimpleTox “Please summarize key information about aDILI for [Drug Name] in 100 words.”
PharmTox
  • “Please summarize aDILI-related information for [Drug Name] in 200 words. “

  • “Include:

  • (1) drug name, class, and indication.

  • (2) usage (dose, route, duration).

  • (3) aliver-related side effects.

  • (4) drug–drug interactions affecting aliver.

  • (5) ADME, especially aliver metabolism and toxic metabolites.

  • (6) mechanisms of aliver injury.

  • (7) abiomarkers (e.g. ALT, AST).

  • (8) patient risk factors (e.g. age, genetics).

  • (9) warnings, case reports, or regulatory actions (e.g. withdrawal or black box warnings).

  • (10) latency and severity of aliver damage.

  • “Also, indicate the **overall aDILI concern level** (e.g. low, moderate, high), and how it compares to similar drugs.”

a

Prompts were modified for DICT where endpoint-specific knowledge is relevant.

Model development and evaluation

Five ML methods with distinct algorithmic complexity (Wu et al. 2021) were applied for model development: K-Nearest Neighbors (KNN), Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGBoost). Detailed descriptions of these methods are available in the literature (Li et al. 2021).

For each combination of knowledge representation and ML method, the training set was randomly split in a stratified manner (the prevalence of positive/negative was maintained) to allocate 80% of the data for model development and 20% for internal evaluation. This process was repeated over 100 iterations to ensure robust performance estimation. A grid search was employed to identify the optimal hyperparameters for each combination of method and knowledge representation. The evaluation results from the 20% splits across the 100 iterations were summarized to provide a comprehensive assessment of each knowledge representation. Thereafter, the entire training dataset was used to build the final QKAR models with the optimized hyperparameters and to make predictions on the test set for two toxicity endpoints, respectively.

Given the imbalanced distribution of positives and negatives in both endpoints, the Matthews Correlation Coefficient (MCC) was used as the primary evaluation metric. MCC is a reliable metric for binary classification, offering a balanced assessment of model performance of a model. Its values range from −1 to 1, with a higher score (closer to 1) indicating accurate predictions for both positive and negative classes. Higher MCC scores indicate better model performance. In addition to MCC, we also evaluated models using accuracy, balanced accuracy, sensitivity, specificity, F1-score, and AUC-ROC to provide a more comprehensive performance assessment.

Comparison of QKARs and QSARs

QSAR models were developed for the same two toxicity endpoints (DILI and DICT) following the same framework as QKARs (i.e. identical datasets and ML methods) with chemical descriptors instead of knowledge representations. The chemical descriptors were generated using Mordred (Moriwaki et al. 2018), an open-source tool that calculates over 1,800 two-dimensional and three-dimensional molecular descriptors. These descriptors capture physicochemical, topological, and geometrical properties and are widely used in QSAR studies. Mordred descriptors have demonstrated strong performance in various property prediction tasks (Saini and Ramanathan 2022; Tian et al. 2022). By maintaining the same ML pipeline and evaluation procedures, this setup allows for a direct comparison between two approaches as well as an assessment of the added value of knowledge representations over structure-based descriptors.

Drug pair analysis

Eight pairs of drugs with similar chemical structures but differing DILI risk were selected from the DILI dataset based on a recent publication from our group (Shrimali et al. 2025). These eight pairs of drugs were removed from the training set, while the remaining 986 drugs in the DILI dataset were used to develop QKAR and QSAR models using LR. DILI predictions were then made for each drug of these eight drug pairs. The QSAR model and the QKAR model were compared for their ability to distiguish the two drugs within a pair by examining the relative risk of the more toxic drug over the less toxic one in causing DILI.

Integration of knowledge- and structure-based representations

The combined feature vectors were created by concatenating knowledge representation with molecular descriptors, allowing the models to capture both semantic insights and physicochemical properties of compounds. Model development and evaluation followed the same pipeline as the QKAR and QSAR models.

Results

Development of QKAR models

We developed QKAR models for two human-relevant drug toxicity endpoints, DILI (Thakkar et al. 2020) and DICT (Qu et al. 2023), as they are the top two causes for drug failure (Onakpoya et al. 2016) and challenging safety concerns to address with existing approaches. We used the DILIst (Thakkar et al. 2020) and DICTrank (Qu et al. 2023) datasets and split the data based on the year of drug approval for model training and testing to simulate the real-world application. Thus, the models were trained on drugs approved in earlier years and tested to prospectively predict the toxicity of those approved later.

The essential component of QKAR modeling is the generation of knowledge representations. As shown in Fig. 1, knowledge representations were generated by first summarizing drug-related toxicity information using GPT-4o. The prompts for DILI are provided in Table 1, whereas those for DICT are listed in Table S1. The knowledge summaries, or the drug names, were then embedded to generate knowledge representations using the text-embedding-3-large model. The input for knowledge embedding varied in length and specificity. Specifically, the first representation was based solely on drug names. The “SimpleTox” representation used a simple prompt to inquire toxicity information about a drug of interest. Lastly, the “PharmTox” representation was based on a comprehensive summary of pharmacological and toxicological data about a drug. We then applied each of the three knowledge representations to five ML methods with different algorithmic complexity (i.e. KNN, LR, RF, SVM, and XGBoost) to develop a total of 15 QKAR models.

Fig. 1.

Fig. 1.

Workflow for QKAR development. QKAR models were developed for both DILI (Drug-Induced Liver Injury) and DICT (Drug-Induced Cardiotoxicity). The datasets were split for training and testing based on drug approval years to simulate real-world applications. For each drug, a knowledge summary was generated using GPT-4o with different prompts. The knowledge summary was then embedded using the text-embedding-3-large model to generate the knowledge representation. Five ML algorithms with varying levels of complexity were applied to develop QKAR models, which were then used to predict DILI and DICT on the respective test sets.

QKAR models performed well with knowledge representations embedding comprehensive toxicity information

We evaluated the performance of the QKAR models in predicting DILI and DICT using a set of metrics, including MCC, accuracy, specificity, sensitivity, AUC, and F1 score (Table S2). Given the imbalanced distribution of positives and negatives in both endpoints, MCC was used as the primary evaluation metric. Performance distributions based on MCC across different knowledge representations, and ML methods were visualized using violin plots for cross-validation and bar plots for testing (Fig. 2).

Fig. 2.

Fig. 2.

Performance of the QKAR models across 3 knowledge representations and 5 ML methods. For both DILI (Drug-Induced Liver Injury) A) and DICT (Drug-Induced Cardiotoxicity) B), the violin plots show the distributions of cross-validation results on the training set, whereas the bar charts display the prediction results on the test sets. CV, cross-validation results; Test, test results.

For DILI prediction, the SimpleTox and PharmTox knowledge representations consistently yielded better prediction results than the DrugName representation across all five algorithms in both cross-validation and testing, with the only exception of KNN in test results (Fig. 2A). Notably, the PharmTox representation demonstrated more consistent performance than the other two, exhibiting low sensitivity (i.e. little variation) to different ML methods.

For DICT prediction, we observed similar patterns in cross-validation, with SimpleTox and PharmTox performing better than DrugName in QKAR models (Fig. 2B, top). In the test results, the QKAR models using the PharmTox representation overall showed the best prediction results with the exception of the KNN-based model. LR models showed the most prominent improvement with the SimpleTox and PharmTox knowledge representations compared with the DrugName representation (Fig. 2B, bottom).

These results suggested that the comprehensive toxicity information enabled higher predictive performance in QKARs than the simpler input (i.e. drug name only), highlighting the importance of domain-specific knowledge for accurate drug toxicity evaluation. Furthermore, the more complex ML methods did not show better prediction accuracy for either DILI or DICT with the test sets. On the contrary, LR, as one of the simple models, achieved much better prediction for DICT than the other four models. The results showed little association between the complexity of the ML methods and the prediction performance, indicating that the selection of algorithms may not be critical.

QKARs outperformed QSARs in both DILI and DICT prediction

To evaluate the performance of QKARs relative to the conventional QSARs, we developed QSAR models using the identical datasets and model development pipeline as for QKAR development; the only difference was to use structure-based descriptors for QSARs instead of knowledge representations for QKARs. For both DILI and DICT, QKAR models consistently outperformed QSAR models in cross-validation regardless of the ML methods that were used, as shown in the violin plots (Fig. 3, top). LR demonstrated to be a simple yet effective method with strong performance; therefore, we further examined the individual models generated during cross-validation with LR. As we performed 5-fold cross-validation with 20 iterations, a total of 100 distinct train–test splits were generated. For each split, two models were developed with LR: One using the QKAR approach and the other using QSAR. Comparative analysis showed that QKAR consistently outperformed QSAR in all 100 DILI models and in 90 out of 100 DICT models (Fig. S1). Furthermore, we compared the model performance on the test set and observed that QKARs again showed superior performance to QSARs (Fig. 3, bottom). Our results suggested that QKAR may serve as a robust alternative to QSAR for enhanced drug toxicity prediction.

Fig. 3.

Fig. 3.

Performance comparison between QKAR and QSAR models. For both DILI (Drug-Induced Liver Injury) A) and DICT (Drug-Induced Cardiotoxicity) B), the violin plots show the distributions of cross-validation results on the training sets, and the bar chart display the prediction results on the test sets. The results from QKARs were based on the PharmTox representations.

QKARs exhibited higher capability than QSARs in differentiating structurally similar drugs with large differences in DILI risks

Chemical structure-based approaches, such as QSARs, tend to have difficulties separating structurally similar compounds with opposite toxicities (Maggiora 2006). To assess the predictive resolution of QKARs on these cases, we compared the performance of the QKAR and QSAR model using the LR method for eight drug pairs; each pair consisted of two structurally similar drugs but with large differences in hepatotoxicity (Fig. 4). We calculated the relative risk of the two drugs in a pair in causing DILI by dividing the model-predicted DILI probability of the more toxic drug by that of the less toxic one. A relative risk greater than 1 indicates the ability of a model to distinguish between two drugs with different potential to cause DILI. The greater the relative risk was, the more capable a model was of differentiating the drug pair. Based on this preliminary analysis, we observed that the QKAR model achieved a toxicity ratio greater than 1 for all eight drug pairs, whereas the QSAR model surpassed this threshold for only five pairs. Our results showed that the QKAR model appeared to exhibit better capability than the QSAR model in differentiating structually-similar drugs with varying toxicity levels.

Fig. 4.

Fig. 4.

Comparison of QKARs and QSARs in differentiating drugs with similar structure but different potential to cause DILI. Each bar represents the relative risk of two drugs in causing DILI based on a model’s prediction. A relative risk greater than 1 (to the right of the red dashed line) indicated the ability of a model in differentiating the more toxic drug from the less toxic one. The greater the relative risk was the better capability the models had in distinguishing a drug pair. Both QKAR and QSAR models were developed using the logistic regression (LR) method. The PharmTox representation was used to develop the QKAR model.

Integrating knowledge- and structure-based representations for enhanced toxicity prediction

We also investigated whether integrating the knowledge and structure of drugs would further improve toxicity prediction. This approach, referred to as Q(K + S)ARs (Quatitative Knowledge&Structure-Activity Relationships), was designed to leverage the complementary strengths of biomedical context and molecular descriptors. For that, we developed the Q(K + S)AR models by combining the knowledge-based and structure-based features and following the same modeling pipeline as depicted in Fig. 1. These models were evaluated across five ML methods and compared against both QSARs and QKARs for both the DILI and DICT endpoints. For both DILI (Fig. 5A) and DICT (Fig. 5B), the Q(K + S)AR models consistently outperformed QSAR models; however, their advantage over QKAR models was modest.

Fig. 5.

Fig. 5.

Performance comparison among Q(K + S)AR, QKAR, and QSAR models. For both DILI (Drug-Induced Liver Injury) A) and DICT (Drug-Induced Cardiotoxicity) B), the violin plots show the distributions of cross-validation results on the training set, whereas the bar charts display the prediction results on the test set. The results from Q(K + S)ARs and QKARs were based on the PharmTox representations.

Discussion

In this study, we developed a novel framework, QKARs, that leverages contextual chemical knowledge from LLMs via text embeddings for toxicity prediction, focusing on DILI and DICT—two of the most challenging endpoints in drug development and safety assessment. We conducted a head-to-head comparison between QKARs and QSARs using the same datasets and workflow, employing five ML algorithms with varying degrees of algorithmic complexity. QKAR models consistently outperformed QSAR models in both training and test sets, which demonstrated the potential utility of QKARs in predictive toxicology. We also investigated integrating knowledge-based and structure-based representations, Q(K + S)ARs, for further enhanced prediction accuracy.

QSARs have played a pivotal role in predictive toxicology by modeling the relationship between chemical structure and biological activity (Hansch et al. 1962; Tropsha et al. 2024). Traditional and AI-enhanced structure representations range from fingerprints (e.g. MACCS [Durant et al. 2002], ECFP [Rogers and Hahn 2010]), descriptors (e.g. Mold2 [Hong et al. 2008], and Dragon [Mauri et al. 2006]) to graph- and SMILES-based embeddings (Goh et al. 2017; Hirohara et al. 2018; Chakravarti and Alla 2019; Chen et al. 2021; Hung and Gini 2021), which have broadened the QSAR landscape to support diverse applications. Recent approaches, including MoLFormer (Ross et al. 2022) and ChemBERTa (Chithrananda et al. 2020), leverage large chemical structure datasets and transformer architectures to learn molecular features through SMILES representations (Duvenaud et al. 2015; Winter et al. 2019; Chithrananda et al. 2020; Fabian et al. 2020; Irwin et al. 2022; Ross et al. 2022; Xia et al. 2022; Yüksel et al. 2023; Zhou et al. 2023). However, the fundamental principle of QSARs—that chemical structure solely determines biological activity—has inherent limitations, as biological activity is influenced by multiple factors beyond structure, particularly in the context of drug toxicity prediction. The advent of LLMs offers a knowledge-driven alternative to traditional QSAR molecular representations by extracting chemical and biological context from vast scientific literature (Biswas 2023; Guo et al. 2023; Zheng et al. 2023; Pradhan et al. 2024). By generating text-based embeddings that capture these scientific data, knowledge-derived descriptors analogous to traditional QSAR descriptors can be used to complement or even enhance toxicity prediction beyond the limitations of structure-based models.

The broad scope of potential applications of QKARs over QSARs is summarized in Table 2. QSARs are mainly limited to small organic molecules and thus struggle with complex mixtures, product formulations, and distinguishing structurally similar compounds with large differences in biological activities (Maggiora 2006; Weaver and Gleeson 2008; Fourches et al. 2010; Cherkasov et al. 2014). They rely heavily on structure-based features and require chemical preprocessing and feature selection, which may lose information. In contrast, QKARs can model a wide variety of chemical products, including biologics, macromolecules, nanomaterials, inorganics, and mixtures, because they use knowledge representation rather than chemical structure. QKARs can also capture formulation differences, avoid reliance on structural similarity, potentially integrates multi-domain data (chemical, biological, environmental, human, and animal), and eliminate the need for feature selection and chemical preprocessing.

Table 2.

Potential diverse applications of QKARs beyond traditional QSARs.

Scope QSARs QKARs
Chemical products Primarily for small organic molecules, posing a challenge in dealing with other types of chemical products. Can model a broad range of products, such as biologics (e.g. monoclonal antibodies), macromolecules, nanomaterials, organometallics, inorganics, herbal medicines, dietary supplements, etc.
Complex mixtures Limited capability to represent and model mixtures, multi-component formulations, or other structurally complex constituents. Because a complex mixture is represented with knowledge, it can be readily modeled.
Product formulation Drug activities are often associated with their ionic forms and salt–counterion interactions, which are not represented appropriately with structure-based chemical descriptors, let alone the information relating to dose and administration route. Knowledge representation can capture the differences in product formulation.
Activity cliffs Difficult to distinguish structurally similar compounds with large differences in biological activities, which is especially pronounced in drug toxicity. Structural similarity plays little role in knowledge that may differentiate the biological differences.
Input features Restricted to structure-based chemical information. Focused on knowledge, allowing integration of multi-domain knowledge (chemical, biological, environmental, animal, and human data).
Modeling complexity Chemical preprocessing and feature selection are often required, which may lead to information loss. Neither chemical preprocessing nor feature selection is needed.

Despite the promising potential of QKARs, several challenges must be addressed to fully realize their capabilities. First, QKARs rely on domain-specific knowledge; therefore, further investigation is required for applying QKARs to data-poor chemicals, such as drug candidates in the early stages of drug development. Second, variability introduced by different prompts may lead to different outcomes in knowledge summarization and, consequently, knowledge-based model performance. It is worth noting that even the same prompt may result in different responses, presenting a concern for reproducibility in results and consistency in explainability. Lastly, lifecycle maintenance of QKAR models for regulatory use must address challenges associated with the evolving nature of LLMs and text embedding methods, all of which may impact prediction consistency and reproducibility. This suggests that strategies for regular model updates, standardization, and validation must be established. Future work should focus on developing toxicity-specific embeddings and fine-tuning LLMs with curated datasets, such as FDA labeling documents. These efforts will help establish QKAR as a versatile, scalable tool for advancing predictive toxicology and supporting safer drug development.

In theory, combining chemical structure information with relevant domain knowledge should enhance predictive performance. To explore this, we attempted an integration approach, Q(K + S)ARs, by concatenating knowledge-based descriptors with structure-based chemical descriptors as model inputs. However, we did not observe a significant improvement over QKARs alone. We speculated that this was due to the fundamentally different nature of the two descriptor types: Individual chemical descriptors represent specific physicochemical properties independently, whereas QKAR descriptors collectively represent integrated domain knowledge. Simply concatenating the two may dilute or compromise key information about the chemicals that is utilized for modeling. There are multiple strategies that can be explored for integrating chemical information and knowledge, at the descriptor level, the model level, or through hybrid approaches, and this remains an area requiring further investigation.

This study highlights the potential of QKARs to advance computational toxicology by leveraging chemical and biological knowledge for reliable safety assessments. The integration of GenAI and text embedding technologies opens new opportunities for further innovation in predictive toxicology. Future efforts should focus on developing domain-specific LLMs fine-tuned on toxicology and pharmacology datasets to enrich and enhance knowledge representations. Combining these approaches with multimodal data sources, such as high-throughput screening results, omics profiles, and clinical reports, may greatly improve model generalizability and predictive accuracy. Moreover, model interpretation techniques will be critical to ensuring the explainability and reliability needed for regulatory consideration of AI-based toxicity predictions. Ultimately, these innovations have the potential to expand the predictive toxicology toolbox with both structure-centered methodologies and knowledge-driven approaches, thereby accelerating safer drug development and personalized risk assessment.

Supplementary Material

kfaf135_Supplementary_Data

Contributor Information

Ting Li, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, United States.

Yanyan Qu, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, United States; University of Arkansas at Little Rock and University of Arkansas for Medical Sciences Joint Bioinformatics Program, Little Rock, AR 72204, United States.

Alexander Chen, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, United States.

Shraddha Thakkar, Office of Translational Sciences, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20903, United States.

Dongying Li, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, United States.

Weida Tong, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, United States.

Supplementary material

Supplementary material is available at Toxicological Sciences online.

Funding

Funding for this research was provided by the US Food and Drug Administration.

Conflicts of interest. None declared.

Data availability

The datasets used to develop and evaluate QKARs were curated from DILIst (https://www.fda.gov/science-research/liver-toxicity-knowledge-base-ltkb/drug-induced-liver-injury-severity-and-toxicity-dilist-dataset) and DICTrank (https://www.fda.gov/science-research/bioinformatics-tools/drug-induced-cardiotoxicity-rank-dictrank-dataset). The knowledge-based representations and structure-based representations have been deposited at https://github.com/TingLi2016/QKAR.

Code availability

QKARs was developed using open-source Python (version 3.7.3). The source code is available at https://github.com/TingLi2016/QKAR.

Disclaimer

This article reflects the views of the authors and does not necessarily reflect those of the US Food and Drug Administration. Any mention of commercial products is for clarification only and is not intended as approval, endorsement, or recommendation.

References

  1. Bang Y, Cahyawijaya S, Lee N, Dai W, Su D, Wilie B, Lovenia H, Ji Z, Yu T, Chung W.  2023. A multitask, multilingual, multimodal evaluation of chatGPT on reasoning, hallucination, and interactivity. In: Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Nusa Dua, Bali. Association for Computational Linguistics. p. 675–718. https://aclanthology.org/2023.ijcnlp-main.45/
  2. Biswas SS.  2023. Role of chat GPT in public health. Ann Biomed Eng. 51:868–869. [DOI] [PubMed] [Google Scholar]
  3. Cao Y, Li S, Liu Y, Yan Z, Dai Y, Yu PS, Sun L.  2023. A comprehensive survey of AI-generated content (AIGC): a history of generative AI from GAN to ChatGPT. arXiv: 230304226, preprint: not peer reviewed.
  4. Chakravarti SK, Alla SRM.  2019. Descriptor free qsar modeling using deep learning with long short-term memory neural networks. Front Artif Intell. 2:17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chen J, Si Y-W, Un C-W, Siu SW.  2021. Chemical toxicity prediction based on semi-supervised learning and graph convolutional neural network. J Cheminform. 13:93–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chen X, Roberts R, Liu Z, Tong W.  2023. A generative adversarial network model alternative to animal studies for clinical pathology assessment. Nat Commun. 14:7141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cherkasov A, Muratov EN, Fourches D, Varnek A, Baskin II, Cronin M, Dearden J, Gramatica P, Martin YC, Todeschini R, et al.  2014. Qsar modeling: where have you been? Where are you going to?  J Med Chem. 57:4977–5010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chithrananda S, Grand G, Ramsundar B.  2020. Chemberta: large-scale self-supervised pretraining for molecular property prediction. arXiv: 201009885, preprint: not peer reviewed.
  9. Durant JL, Leland BA, Henry DR, Nourse JG.  2002. Reoptimization of mdl keys for use in drug discovery. J Chem Inf Comput Sci. 42:1273–1280. [DOI] [PubMed] [Google Scholar]
  10. Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, Adams RP.  2015. Convolutional networks on graphs for learning molecular fingerprints. In: Proceedings of the 29th International Conference on Neural Information Processing Systems -Volume 2 (NIPS'15), Vol. 2. Cambridge (MA): MIT Press. p. 2224–2232. [Google Scholar]
  11. Fabian B, Edlich T, Gaspar H, Segler M, Meyers J, Fiscato M, Ahmed M.  2020. Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv: 201113230, preprint: not peer reviewed.
  12. Fourches D, Muratov E, Tropsha A.  2010. Trust, but verify: on the importance of chemical structure curation in cheminformatics and qsar modeling research. J Chem Inf Model. 50:1189–1204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fui-Hoon Nah F, Zheng R, Cai J, Siau K, Chen L.  2023. Generative AI and ChatGPT: applications, challenges, and AI-human collaboration. J Inf Technol Case Appl Res. 25:277–304. [Google Scholar]
  14. Goh GB, Hodas NO, Siegel C, Vishnu A.  2017. Smiles2vec: an interpretable general-purpose deep neural network for predicting chemical properties. arXiv: 171202034, preprint: not peer reviewed.
  15. Goldkind L, Laine L.  2006. A systematic review of NSAIDs withdrawn from the market due to hepatotoxicity: lessons learned from the bromfenac experience. Pharmacoepidemiol Drug Saf. 15:213–220. [DOI] [PubMed] [Google Scholar]
  16. Guo T, Nan B, Liang Z, Guo Z, Chawla N, Wiest O, Zhang X.  2023. What can large language models do in chemistry? A comprehensive benchmark on eight tasks. Adv Neural Inf Process Syst. 36:59662–59688. [Google Scholar]
  17. Hansch C, Maloney PP, Fujita T, Muir RM.  1962. Correlation of biological activity of phenoxyacetic acids with hammett substituent constants and partition coefficients. Nature. 194:178–180. [Google Scholar]
  18. Haßmann U, Amann S, Babayan N, Fankhauser S, Hofmaier T, Jakl T, Nendza M, Stopper H, Stefan SM, Landsiedel R.  2024. Predictive, integrative, and regulatory aspects of AI-driven computational toxicology—highlights of the German pharm-tox summit (GPTS) 2024. Toxicology. 509:153975. [DOI] [PubMed] [Google Scholar]
  19. Hirohara M, Saito Y, Koda Y, Sato K, Sakakibara Y.  2018. Convolutional neural network based on smiles representation of compounds for detecting chemical motif. BMC Bioinformatics. 19:526–594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hong H, Xie Q, Ge W, Qian F, Fang H, Shi L, Su Z, Perkins R, Tong W.  2008. Mold2, molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics. J Chem Inf Model. 48:1337–1344. [DOI] [PubMed] [Google Scholar]
  21. Hung C, Gini G.  2021. QSAR modeling without descriptors using graph convolutional neural networks: the case of mutagenicity prediction. Mol Divers. 25:1283–1299. [DOI] [PubMed] [Google Scholar]
  22. Irwin R, Dimitriadis S, He J, Bjerrum EJ.  2022. Chemformer: a pre-trained transformer for computational chemistry. Mach Learn Sci Technol. 3:015022. [Google Scholar]
  23. Kavlock RJ, Ankley G, Blancato J, Breen M, Conolly R, Dix D, Houck K, Hubal E, Judson R, Rabinowitz J, et al.  2008. Computational toxicology—a state of the science mini review. Toxicol Sci. 103:14–27. [DOI] [PubMed] [Google Scholar]
  24. Kusupati A, Bhatt G, Rege A, Wallingford M, Sinha A, Ramanujan V, Howard-Snyder W, Chen KF, Kakade S, Jain P, et al.  2022. Matryoshka representation learning. arXiv, preprint: not peer reviewed
  25. Li T, Chen X, Tong W.  2024. Bridging organ transcriptomics for advancing multiple organ toxicity assessment with a generative ai approach. NPJ Digit Med. 7:310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Li T, Liu Z, Thakkar S, Roberts R, Tong W.  2023. DeepAmes: a deep learning-powered Ames test predictive model with potential for regulatory application. Regul Toxicol Pharmacol. 144:105486. [DOI] [PubMed] [Google Scholar]
  27. Li T, Tong W, Roberts R, Liu Z, Thakkar S.  2021. Deepdili: deep learning-powered drug-induced liver injury prediction using model-level representation. Chem Res Toxicol. 34:550–565. [DOI] [PubMed] [Google Scholar]
  28. Li T, Tong W, Roberts R, Liu Z, Thakkar S.  2021. Deepcarc: deep learning-powered carcinogenicity prediction using model-level representation. Front Artif Intell. 4:757780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Liu Y, Ju S, Wang J.  2024. Exploring the potential of chatgpt in medical dialogue summarization: a study on consistency with human preferences. BMC Med Inform Decis Mak. 24:75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Maggiora GM.  2006. On outliers and activity cliffs why QSAR often disappoints. J Chem Inf Model. 46:1535. [DOI] [PubMed] [Google Scholar]
  31. Mauri A, Consonni V, Pavan M, Todeschini R.  2006. Dragon software: an easy approach to molecular descriptor calculations. Match. 56:237–248. [Google Scholar]
  32. Mbakwe AB, Lourentzou I, Celi LA, Mechanic OJ, Dagan A.  2023. ChatGPT passing usmle shines a spotlight on the flaws of medical education. San Francisco (CA: ): Public Library of Science. p. e0000205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Moret M, Pachon Angona I, Cotos L, Yan S, Atz K, Brunner C, Baumgartner M, Grisoni F, Schneider G.  2023. Leveraging molecular structure and bioactivity with chemical language models for de novo drug design. Nat Commun. 14:114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Moriwaki H, Tian YS, Kawashita N, Takagi T.  2018. Mordred: a molecular descriptor calculator. J Cheminform. 10:4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Muratov EN, Bajorath J, Sheridan RP, Tetko IV, Filimonov D, Poroikov V, Oprea TI, Baskin II, Varnek A, Roitberg A, et al.  2020. QSAR without borders. Chem Soc Rev. 49:3525–3564. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Muster W, Breidenbach A, Fischer H, Kirchner S, Muller L, Pahler A.  2008. Computational toxicology in drug development. Drug Discov Today. 13:303–310. [DOI] [PubMed] [Google Scholar]
  37. Nantasenamat C, Isarankura-Na-Ayudhya C, Prachayasittikul V.  2010. Advances in computational methods to predict the biological activity of compounds. Expert Opin Drug Discov. 5:633–654. [DOI] [PubMed] [Google Scholar]
  38. Onakpoya IJ, Heneghan CJ, Aronson JK.  2016. Post-marketing withdrawal of 462 medicinal products because of adverse drug reactions: a systematic review of the world literature. BMC Med. 14:10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Pérez Santín E, Rodríguez Solana R, González García M, García Suárez MDM, Díaz GDB, Cabal MDC, Rojas JMM, Sánchez JIL.  2021. Toxicity prediction based on artificial intelligence: a multidisciplinary overview. WIREs Comput MolSci. 11:e1516. 10.1002/wcms.1516 [DOI] [Google Scholar]
  40. Popova M, Isayev O, Tropsha A.  2018. Deep reinforcement learning for de novo drug design. Sci Adv. 4:eaap7885. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Pradhan T, Gupta O, Chawla G.  2024. The future of chatgpt in medicinal chemistry: harnessing AI for accelerated drug discovery. ChemistrySelect. 9:e202304359. [Google Scholar]
  42. Qu Y, Li T, Liu Z, Li D, Tong W.  2023. Dictrank: the largest reference list of 1318 human drugs ranked by risk of drug-induced cardiotoxicity using FDA labeling. Drug Discov Today. 28:103770. [DOI] [PubMed] [Google Scholar]
  43. Qu Y, Li T, Liu Z, Tong W, Li D.  2025. Dictrank is a reliable dataset for cardiotoxicity prediction using machine learning methods. Chem Res Toxicol. 38:647–655. [DOI] [PubMed] [Google Scholar]
  44. Rainsford KD.  2009. Ibuprofen: pharmacology, efficacy and safety. Inflammopharmacology. 17:275–342. [DOI] [PubMed] [Google Scholar]
  45. Rogers D, Hahn M.  2010. Extended-connectivity fingerprints. J Chem Inf Model. 50:742–754. [DOI] [PubMed] [Google Scholar]
  46. Ross J, Belgodere B, Chenthamarakshan V, Padhi I, Mroueh Y, Das P.  2022. Large-scale chemical language representations capture molecular structure and properties. Nat Mach Intell. 4:1256–1264. [Google Scholar]
  47. Saini K, Ramanathan V.  2022. Predicting odor from molecular structure: a multi-label classification approach. Sci Rep. 12:13863. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Shrimali S, Chen M, Li D, Tong W.  2025. New approach methodologies (NAMs) for drug-induced liver injury (DILI): where are we now?  Drug Discov Today. 30:104452. [DOI] [PubMed] [Google Scholar]
  49. Sinha K, Ghosh N, Sil PC.  2023. A review on the recent applications of deep learning in predictive drug toxicological studies. Chem Res Toxicol. 36:1174–1205. [DOI] [PubMed] [Google Scholar]
  50. Thakkar S, Li T, Liu Z, Wu L, Roberts R, Tong W.  2020. a. Drug-induced liver injury severity and toxicity (dilist): binary classification of 1279 drugs by human hepatotoxicity. Drug Discov Today. 25:201–208. [DOI] [PubMed] [Google Scholar]
  51. Tian H, Ketkar R, Tao P.  2022. Admetboost: a web server for accurate ADMET prediction. J Mol Model. 28:408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Tropsha A, Isayev O, Varnek A, Schneider G, Cherkasov A.  2024. Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR. Nat Rev Drug Discov. 23:141–155. [DOI] [PubMed] [Google Scholar]
  53. Wang H, Wu W, Dou Z, He L, Yang L.  2023. Performance and exploration of ChatGPT in medical examination, records and education in Chinese: pave the way for medical AI. Int J Med Inform. 177:105173. [DOI] [PubMed] [Google Scholar]
  54. Weaver S, Gleeson MP.  2008. The importance of the domain of applicability in QSAR modeling. J Mol Graph Model. 26:1315–1326. [DOI] [PubMed] [Google Scholar]
  55. Winter R, Montanari F, Noé F, Clevert D-A.  2019. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci. 10:1692–1701. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Wu L, Huang R, Tetko IV, Xia Z, Xu J, Tong W.  2021. Trade-off predictivity and explainability for machine-learning powered predictive toxicology: an in-depth investigation with tox21 data sets. Chem Res Toxicol. 34:541–549. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Xia J, Zhu Y, Du Y, Li SZ.  2022. A systematic survey of chemical pre-trained models. In: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI '23), Article 760. p. 6787–6795. 10.24963/ijcai.2023/760 [DOI]
  58. Ying L, Liu Z, Fang H, Kusko R, Wu L, Harris S, Tong W.  2024. Text summarization with ChatGPT for drug labeling documents. Drug Discov Today. 29:104018. [DOI] [PubMed] [Google Scholar]
  59. Yüksel A, Ulusoy E, Ünlü A, Doğan T.  2023. Selformer: molecular representation learning via selfies language models. Mach Learn Sci Technol. 4:025035. [Google Scholar]
  60. Zheng Z, Zhang O, Borgs C, Chayes JT, Yaghi OM.  2023. Chatgpt chemistry assistant for text mining and the prediction of MOF synthesis. J Am Chem Soc. 145:18048–18062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Zhou G, Gao Z, Ding Q, Zheng H, Xu H, Wei Z, Zhang L, Ke G. . 2023. Uni-mol: a universal 3D molecular representation learning framework. ChemRxiv; 10.26434/chemrxiv-2022-jjm0j [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kfaf135_Supplementary_Data

Data Availability Statement

The datasets used to develop and evaluate QKARs were curated from DILIst (https://www.fda.gov/science-research/liver-toxicity-knowledge-base-ltkb/drug-induced-liver-injury-severity-and-toxicity-dilist-dataset) and DICTrank (https://www.fda.gov/science-research/bioinformatics-tools/drug-induced-cardiotoxicity-rank-dictrank-dataset). The knowledge-based representations and structure-based representations have been deposited at https://github.com/TingLi2016/QKAR.


Articles from Toxicological Sciences are provided here courtesy of Oxford University Press

RESOURCES