Multimodal deep learning for chemical toxicity prediction and management

Jiwon Hong; Hyun Kwon

doi:10.1038/s41598-025-95720-5

. 2025 Jun 3;15:19491. doi: 10.1038/s41598-025-95720-5

Multimodal deep learning for chemical toxicity prediction and management

Jiwon Hong ¹, Hyun Kwon ^2,^✉

PMCID: PMC12134256 PMID: 40461585

Abstract

The accurate prediction of chemical toxicity is a crucial research focus in chemistry, biotechnology, and national defense. The development of comprehensive datasets for chemical toxicity prediction remains limited due to security constraints and the structural complexity of chemical data. Existing studies are often confined to specific domains, such as genotoxicity or acute oral toxicity. To address these gaps, this study introduces an integrated research dataset that combines chemical property data and molecular structure images. The dataset is curated from diverse sources, preprocessed, and normalized to optimize it for deep learning applications. The proposed deep learning model enhances the precision of multi-toxicity predictions by integrating Vision Transformer (ViT) architecture for image-based data and a Multilayer Perceptron (MLP) for numerical data. A joint fusion mechanism is employed to effectively combine image and numerical features, significantly improving predictive performance. The model is also designed for multi-label toxicity prediction, enabling simultaneous evaluation of diverse toxicological endpoints. Experimental results show that ViT model demonstrate an accuracy of 0.872, an F1-score of 0.86, and a Pearson Correlation Coefficient (PCC) of 0.9192.

Keywords: Chemical substances, Toxicity, Chemical safety management, Biochemistry, Vision Transformer (ViT), Wearable sensors

Subject terms: Biochemistry, Chemistry, Chemical engineering

Introduction

Accurate prediction of chemical toxicity^1–5 has emerged as a pivotal research area in chemistry, biotechnology, and national defense. As global conflicts and the threats of asymmetric warfare become increasingly prominent, the potential deployment of chemical weapons⁶ and the occurrence of chemical terrorism pose significant challenges to public safety and security. This highlights the urgent need for sophisticated systems capable of predicting, detecting, and mitigating chemical toxicity to protect both military personnel and civilians.

In addition to these pressing security concerns, chemical toxicity prediction⁷ is essential for safeguarding the environment and public health. The widespread use of industrial chemicals⁸, pesticides⁹, and pharmaceuticals necessitates precise toxicological assessments to ensure regulatory compliance and minimize harm. However, the inherent complexity of chemical substances and the scarcity of comprehensive datasets have hindered progress in this field. Current prediction models often rely on narrow datasets focused on specific toxic endpoints, such as genotoxicity or acute oral toxicity, which limits their generalizability and practical application.

While traditional machine learning techniques^1,3,10,11 have been employed in toxicity prediction, they frequently fall short due to their reliance on manually engineered features and their inability to effectively model the non-linear relationships inherent in chemical data. Deep learning models, on the other hand, offer a transformative potential by leveraging advanced architectures to extract and integrate complex patterns from diverse data sources. Despite these advancements, existing deep learning approaches are often restricted to single-modality inputs, such as either numerical data or molecular structure images, failing to capitalize on the synergistic benefits of multi-modal data fusion.

To address these limitations, we introduces an innovative framework for chemical toxicity prediction that integrates chemical property data with molecular structure images into a unified multi-modal deep learning model. Our approach leverages a Vision Transformer (ViT) for processing image-based features and a Multilayer Perceptron (MLP) for handling numerical data, enabling a joint fusion mechanism that significantly enhances predictive accuracy. Unlike existing methods that classify using a single type of dataset, the proposed method introduces a multi-label classification model that integrates multiple dataset types to determine whether a chemical is toxic or non-toxic.

Compared to existing studies, our approach provides several contributions: (1) the development of a comprehensive dataset that integrates chemical property data with molecular structure images, (2) the implementation of a multi-modal deep learning model that achieves superior accuracy through effective data fusion, and (3) the incorporation of wearable sensors to extend the framework’s applicability to dynamic and high-risk environments. These innovations collectively address the critical gaps in existing research, enabling more precise toxicity predictions and facilitating the development of robust chemical safety management systems.

The remainder of this paper is organized as follows: Section “Related work” reviews the related work in the field. Section “Proposed method” presents the proposed methodology in detail. Section “Experimental setup and results” outlines the experimental setup and provides an evaluation of the results. Section “Discussion” offers an in-depth discussion of the proposed approach. Finally, Section “Conclusion” summarizes the findings and concludes the paper.

Related work

The prediction of chemical toxicity has been extensively studied, owing to its critical implications in environmental safety, drug development, and defense against chemical threats. Traditional methods primarily rely on in vivo and in vitro testing, which, despite being highly reliable, are time-consuming, expensive, and ethically challenging. Consequently, computational approaches, particularly those leveraging machine learning and deep learning, have gained significant attention as efficient alternatives.

Traditional machine learning approaches

Early computational efforts in toxicity prediction focused on Quantitative Structure-Activity Relationship (QSAR)¹² models, which correlate chemical structures with biological activities or toxic effects. Algorithms such as Support Vector Machines (SVM)¹³, Random Forests (RF)¹⁴, and k-Nearest Neighbors (k-NN)¹⁵ have been widely utilized. For instance, Matthews et al. (2016) developed a QSAR model using Random Forests to predict acute toxicity endpoints. Similarly, Trinh et al.¹⁶ demonstrated the effectiveness of machine learning in predicting specific endpoints like carcinogenicity and genotoxicity. While these methods achieved moderate success, their reliance on manually engineered features limited their capacity to model complex, non-linear relationships in chemical data.

Deep learning for toxicity prediction

Deep learning models have emerged as powerful tools for addressing the limitations of traditional machine learning. These models, capable of learning hierarchical feature representations, have shown promise in processing diverse data types, such as numerical descriptors and molecular graphs. For instance, Sharma et al.¹⁷ proposed a multi-task deep neural network for predicting various toxicity endpoints, demonstrating improved accuracy over traditional QSAR models. Similarly, Sun et al.¹⁸ utilized graph convolutional networks (GCNs) to capture structural information from molecular graphs, achieving state-of-the-art performance in predicting mutagenicity and other toxic endpoints. However, most existing deep learning models are designed for single-modality inputs. For example, Schwartz et al.¹⁹ processes molecular fingerprints or SMILES strings, while Hirohara et al.²⁰ employs convolutional neural networks (CNNs) to analyze chemical images. These approaches often fail to exploit the synergistic benefits of integrating multiple data modalities, such as combining chemical property descriptors with structural images.

Multi-modal toxicity prediction models

Multi-modal learning has recently gained traction as a means of integrating heterogeneous data sources for more robust toxicity prediction. For example, Schneider et al.²¹ introduced a hybrid model combining molecular descriptors and molecular dynamics simulation data, showing enhanced predictive accuracy for endocrine disruption. Similarly, Liu et al.²² proposed a multi-modal framework that fuses text-based chemical descriptions with image data for broader applicability across diverse datasets. These studies highlight the potential of multi-modal approaches but also reveal challenges, including increased computational complexity and the need for sophisticated data fusion mechanisms.

Building upon these prior works, our study addresses these gaps by introducing a novel multi-modal deep learning framework that combines chemical property data with molecular structure images for binaray-label toxicity prediction. Additionally, by integrating wearable sensor technology, our approach bridges the gap between predictive modeling and real-world applicability, offering a scalable solution for dynamic and high-risk environments.

Proposed method

In this study, we propose a multi-modal deep learning model aimed at predicting the toxicity of chemical compounds. In Fig. 1, the model leverages both image-based and tabular data inputs to improve the prediction accuracy. Specifically, we adopt the Joint or Intermediate Fusion strategy, which combines information from different modalities at an intermediate stage of the model. This approach allows the model to learn the interactions between different data types while preserving the unique characteristics of each modality.

Model architecture

The proposed architecture consists of two primary components: image processing and tabular data processing. These two components are fused at an intermediate stage, leading to a final prediction of toxicity. The architecture can be outlined as follows:

Image processing backbone: vision transformer (ViT)

The first input modality consists of 2D structural images of chemical compounds, such as molecular structures. These images are processed by a pre-trained Vision Transformer (ViT) model, which has been fine-tuned to handle chemical structure images. The ViT model employed in this study follows the ViT-Base/16 architecture introduced by Dosovitskiy et al.²³, which was pre-trained on the ImageNet-21k dataset and processes input images as 16 × 16-pixel patches at a resolution of 224 × 224 pixels. To adapt this model for chemical structure recognition, we fine-tuned it using a custom dataset of 4179 molecular structure images. These images were collected programmatically using a Python-based web crawler, which systematically extracted publicly available molecular structure images from chemical databases such as PubChem and eChemPortal based on CAS (Chemical Abstracts Service) numbers. Each image was annotated with its corresponding CAS number to ensure alignment with chemical property data. The chemical diversity of the dataset was carefully curated to include a broad spectrum of organic and inorganic compounds. Specifically, the selected CAS numbers encompassed pharmaceuticals, agrochemicals, and industrial chemicals, with deliberate inclusion of compounds featuring diverse functional groups, stereochemical configurations, and molecular sizes. This approach ensures that the model is exposed to a representative subset of chemical space, enhancing its generalizability. The ViT model extracts features from these images and converts them into a 128-dimensional feature vector. Let Inline graphic represent the input image of height H, width W, and C channels. The Vision Transformer processes the image and generates a feature vector as follows:

The number of trainable parameters in the MLP layer used for dimensionality reduction is:

Tabular data processing: multi-layer perceptron (MLP)

The second input modality consists of tabular data representing the chemical properties of the compounds. This data includes both numerical and categorical features as show in Table 8. The tabular data is processed by a multi-layer perceptron (MLP) that transforms it into a 128-dimensional feature vector. Let Inline graphic represent the tabular data, where denotes the number of features in the dataset. The MLP processes the tabular data and generates a feature vector , which can be expressed as:

The number of trainable parameters in this MLP layer is given by:

Table 8.

Summary of toxicity endpoints across symptom categories, including the number of unique symptom labels and the top 10 most frequently occurring symptoms within each category.

Symptom category	Number of unique symptom labels	Top 10 most frequently occurring symptoms within each category
General symptoms	50	Irritation (844), Burns (381), Dizziness (212), Headache (195), Vomiting (167), Death (155), Dermatitis (110), Edema (88), Convulsions (81), Coma (58)
Inhalation	60	Irritation (3466), Headache (1620), Dizziness (1396), Respiratory Dysfunction (1306), Edema (1289), Death (922), Dyspnea (870), Coma (613), Convulsions (609), Vomiting (558)
Dermal	44	Irritation (3526), Dermatitis (2506), Burns (1033), Cyanosis (518), Blisters (458), Edema (330), Sweating (320), Death (231), Rash (139), Convulsions (73)
Ocular	34	Irritation (3337), Congestion (1452), Dermatitis (1224), Burns (745), Corneal Damage (386), Edema (63), Death (47), Tearing (40), Convulsions (31), Hemorrhage (17)
Oral	62	Vomiting (2700), Irritation (1705), Diarrhea (1547), Death (1156), Pneumonitis (726), Burns (674), Headache (643), Convulsions (630), Mental Confusion (461), Coma (461)
Other	23	Convulsions (7), Headache (7), Coma (6), Anxiety (6), Irritation (6), Lightheadedness (5), Dermatitis (4), Dizziness (3), Lethargy (3), Confusion (2)

Open in a new tab

Fusion layer: joint fusion of features

The features extracted from both modalities are then fused at an intermediate stage. The image feature vector Inline graphic and the tabular data feature vector are concatenated to form a fused feature vector :

This fused feature vector Inline graphic is then passed to the toxicity prediction module for further processing.

Toxicity prediction module: fully connected layer

The fused feature vector Inline graphic is passed through a fully connected layer (MLP) to generate the final toxicity prediction. The output of this module consists of independent probability values for each toxicity label. Let represent the predicted probability of toxicity for label i, which can be calculated as:

where Inline graphic denotes the sigmoid activation function, is the weight vector associated with label i, and is the bias term. The sigmoid function ensures that the output is a probability value in the range [0, 1]. The total number of trainable parameters in this layer is:

where C represents the number of toxicity labels.

During training, all three MLP layers (image_fc, feature_fc, output_fc) are optimized jointly, ensuring that the ViT-based image representation and tabular MLP representation are updated simultaneously rather than freezing the early layers. The optimization follows Adam optimizer, as the output consists of multiple independent probability scores.

Thresholding for prediction

Once the probabilities for each toxicity label have been computed, a threshold of 0.5 is applied to determine whether a compound is toxic for each label. If Inline graphic the compound is predicted to be toxic for label (i.e., the label is classified as positive). If the label is classified as negative. This decision rule can be expressed as:

Algorithm 1 — Multi-modal deep learning for toxicity prediction

Training objective

The model is trained using multi-label binary cross-entropy loss for each toxicity label. The multi-label cross-entropy loss for label i is defined as:

The total loss for the model is the sum of the individual losses for all labels:

where Inline graphic is the number of toxicity labels. The model is optimized to minimize the total loss function, thereby improving the accuracy of the toxicity predictions. Details of the procedure for the proposed scheme are provided in Algorithm 1.

The proposed method has several advantages. First, the joint fusion strategy allows for the effective integration of diverse data modalities, enabling the model to leverage complementary information from both image and tabular data sources. This enhances the model’s ability to capture complex relationships between a chemical compound’s structure and its toxicity profile.

Second, the use of deep learning models such as Vision Transformer (ViT) and MLP ensures that the model can handle large-scale datasets with high-dimensional inputs, making it suitable for real-world chemical toxicity prediction tasks. Moreover, the model’s outputs-probabilities for each toxicity label-can be thresholded to provide interpretable binary predictions, offering valuable insights for decision-making.

In conclusion, the proposed multi-modal deep learning model represents a robust framework for predicting the toxicity of chemical compounds. By effectively integrating image and tabular data, the model provides a more accurate and efficient prediction compared to traditional methods that rely on single-modal inputs.

Experimental setup and results

Experimental setup

The experimental environment was designed to facilitate the effective integration of chemical property data and molecular structure image data for predicting multi-label toxicity across multiple categories, including general symptoms, inhalation, skin, eye, oral, and others. The entire pipeline was implemented using Python 3.10.12²⁴ on a Linux system equipped with an NVIDIA A100-SXM4-40GB GPU, which supports CUDA version 12.1. The deep learning frameworks PyTorch 2.1.0 and TensorFlow 2.17.1²⁵ were utilized for model development, training, and evaluation. The GPU availability was verified through the frameworks, ensuring optimized computational performance during experimentation.

The dataset was constructed by fusing molecular structure image data with chemical property and toxicity datasets. To construct the molecular structure image data, 4179 molecular structure images were collected using a Python-based web crawler as shown in Fig. 2. This crawler scraped publicly available chemical information websites, including PubChem and eChemPortal, to download molecular structure images based on CAS (Chemical Abstracts Service)³⁴ numbers. Each image was annotated with its corresponding CAS number and stored with its file path for seamless integration with chemical property data. Additionally, no image rotations or flipping were applied in the preprocessing, while zero normalization was performed.

Fig. 2 — An example of molecular structure images.

Chemical property and toxicity data were curated from eight publicly available sources^26–33, including the chemical safety information provided by Korea’s Ministry of Environment as shown in Tables 1 and 2. Relevant columns were extracted and merged into a single dataset using the merge() function from the pandas library, with CAS numbers serving as the primary key. Toxicity data, comprising textual descriptions of symptoms for various exposure types (e.g., general symptoms, inhalation, skin, eye, oral, others), were processed using natural language processing (NLP)³⁵ techniques. Tokenization, stopword removal, and stemming were applied to convert textual data into a structured keyword list suitable for machine learning tasks. For instance, textual descriptions such as “causes permanent damage to the digestive tract and may result in nausea, vomiting, and diarrhea” were transformed into keywords such as “digestive tract damage, nausea, vomiting, diarrhea.”

Table 1.

Chemical property and toxicity data were curated from eight publicly available sources on Korea’s Ministry of Environment.

File name	Columns	Data types	Total Rows
toxicity.csv²⁶	Serial Number, Substance Name (English), Substance Name (Korean), CAS Number, General Symptoms (Inhalation, Skin, Eye, Oral, Others)	Object	7189
chemistry.csv²⁷	CAS_NO, Chemical Name (Korean/English), DTXSID_CN, Structural Formula File Name, Source, Year	Object	12818
Chemicals_Density.csv²⁸	cas_no, Density, Unit, Condition, Source, Source Description, Korean Name, English Name	Float	6701
Chemicals_Molecular_Weight.csv²⁹	cas_no, Molecular Weight, Condition, Source, Source Description, Korean Name, English Name	Float	15147
Chemicals_Boiling_Point.csv³⁰	sn_no, cas_no, Boiling Point, Condition, Source, Source Description, Korean Name, English Name	Categorical	16726
Chemicals_Material_Properties.csv³¹	cas_no, Color, Status, Condition, Source, Source Description, Korean Name, English Name	Categorical	11667
Chemicals_Melting_Point.csv³²	cas_no, Melting Point, Condition, Source, Source Description, Korean Name, English Name	Float	16981
Chemicals_Water.csv³³	cas_no, Water Solubility, Unit, Condition, Source	Float	22779

Open in a new tab

Table 2.

The merged dataset in Table 1.

Column Name	Description	Data Types	Count
CAS_NO	CAS Number	Alphanumeric	4179
dens	Density	Float	2679
mole_cn	Molecular Weight	Float	4003
boilpt_cn	Boiling Point	Float + Range	3070
color_cn	Color	Categorical	4178
sttus_cn	State	Categorical	4179
meltpt_cn	Melting Point (Freezing Point)	Float + Range	3159
water_slbl_cn	Water Solubility	Float	368

Open in a new tab

The merged dataset comprised 4179 chemical records with molecular structure images and detailed property data, as shown in Table 2. Missing values in numerical features such as boiling points and melting points were imputed using correlated features, and range values were replaced with their midpoint. Categorical variables such as color and state were encoded using one-hot encoding, ensuring that all features were normalized for training. Toxicity data were converted into multi-label binary vectors using MultiLabelBinarizer, enabling the representation of symptoms as binary vectors (e.g., Inline graphic ). The final constructed dataset is shown in Table 3.

Table 3.

The final constructed dataset.

Dataset	Data format	Count
Chemical Properties Data	CSV (text)	4179
2D Structure Images per Chemical	JPG (Image)	4179
Total		8358

Open in a new tab

To ensure robust evaluation, the dataset was split into training, validation, and testing sets in a 7:1:2 ratio using the train_test_split function from the scikit-learn library³⁶, with the random seed set to 0. During model training, a batch size of 16 was utilized to handle data efficiently. The optimization algorithm employed in our study is Adam, with a learning rate set to 0.0001³⁷. The dropout rate was configured at 0.3 to mitigate overfitting. For early stopping, training was halted if validation accuracy did not improve for three consecutive epochs. Additionally, the pre-trained model utilized was the Vision Transformer (ViT) model trained on the ImageNet-21k dataset, specifically the variant with a patch size of 16. The multi-label binary cross-entropy loss (BCELoss) function was chosen to handle the multi-label nature of the predictions. Training was conducted over 30 epochs, and early stopping was implemented to prevent overfitting if the validation accuracy did not improve for three consecutive epochs. Model performance was evaluated using metrics such as accuracy, F1-score³⁸, and Pearson correlation coefficient (PCC)³⁹.

Experimental results

In this section, we conducted a performance analysis of the BEiT and DeiT models to compare their performance with the ViT model used in the proposed method. Bidirectional Encoder for Image Transformers (BEiT) applies the principles of BERT, originally developed for Natural Language Processing (NLP), to image data. It leverages self-supervised learning and bidirectional context understanding, enabling the model to capture complex image representations. However, BEiT requires intricate tokenization processes, which can add complexity to the model. Data-efficient Image Transformer (DeiT) improves the efficiency of ViT by using knowledge distillation, allowing the model to learn effectively with a smaller amount of data. It can achieve comparable performance to ViT with only about 1/10th of the training data, though its performance is heavily dependent on the quality of the teacher model. Additionally, DeiT has limitations when handling more complex tasks.

Figure 3 illustrates the loss values of the three models (ViT⁴⁰, BEiT⁴¹, and DeiT⁴²) over 22 epochs. ViT exhibits the most efficient loss minimization, rapidly decreasing in the initial epochs and converging to almost zero by the 20th epoch. This demonstrates its robust learning capability and effective optimization during training. DEiT follows a similar trend, with a steady decline in loss, although it converges slightly slower than ViT. In contrast, BEiT shows a slower reduction in loss and stabilizes at a higher value, indicating its relative inefficiency in adapting to the dataset. This behavior may be attributed to the model’s reliance on pretraining strategies that are less suited to the domain-specific characteristics of this dataset.

Fig. 3 — The loss values of ViT, DeiT, and BEiT models over 20 epochs, highlighting the optimization progress of each model.

Figure 4 depicts the accuracy of the models for each epoch in the validation dataset. ViT achieves the highest accuracy, reaching approximately 95.39% at the 21th epoch. This result highlights its superior capacity to generalize and correctly classify samples. DeiT also demonstrates commendable performance, attaining an accuracy of 79.57%, which reflects the effectiveness of its teacher-student distillation approach. However, BEiT underperforms significantly, with its accuracy plateauing at 19.67%. This limitation suggests that BEiT’s self-supervised pretraining and tokenization mechanisms might not be optimally aligned with the dataset’s characteristics.

Fig. 4 — The accuracy trends of ViT, DeiT, and BEiT models over 20 epochs, demonstrating their performance improvement during training in validation dataset.

Figure 5 shows the F1 score, a metric that balances precision and recall in the validation dataset. Here, ViT once again leads, achieving a final F1 score close to 0.8845, signifying its strength in handling imbalanced data distributions. DeiT performs competitively, reaching an F1 score of 0.8706, which underscores its reliability in maintaining a balance between precision and recall. Conversely, BEiT shows slower growth in its F1 score, culminating at a much lower value of 0.6801. This underperformance indicates challenges in adapting its masked patch prediction mechanism to this particular task.

Fig. 5 — F1 score progression of ViT, DeiT, and BEiT models across 20 epochs, reflecting their balance between precision and recall in validation dataset.

Figure 6 illustrates the Pearson Correlation Coefficient (PCC), which measures the linear correlation between predicted and actual values in the validation dataset. ViT stands out with the highest PCC value of 0.8971, reflecting its ability to accurately capture the underlying data patterns. DeiT follows with a PCC value of 0.5770, showing its effectiveness in preserving predictive accuracy. BEiT, however, achieves a lower PCC value of 0.2178, highlighting its limitations in modeling the linear relationships between inputs and outputs. This could be due to suboptimal pretraining or insufficient alignment with the dataset’s structure.

Fig. 6 — Pearson Correlation Coefficient (PCC) of ViT, DeiT, and BEiT models over 20 epochs, indicating their alignment with ground truth in validation dataset.

The evaluation metrics reported in Figs. 4, 5, and 6 were computed as follows: accuracy and F1-score were calculated for each toxic endpoint individually and then averaged across all endpoints to provide a single representative value for each method. This approach ensures a comprehensive evaluation of model performance across the validation dataset. For the Pearson Correlation Coefficient (PCC), the computation was performed by comparing the predicted toxicity values with the ground truth values across all samples and endpoints. The reported PCC values represent the overall correlation between predicted and actual toxicity levels, providing a measure of the model’s predictive consistency.

Table 4 presents a detailed performance comparison of three models-ViT, BEiT, and DeiT-evaluated across three key metrics in the test dataset. The accuracy, F1 score, and Pearson Correlation Coefficient (PCC) indicate the performance of the proposed method on the test dataset. Regarding accuracy, ViT achieves an impressive 87.2%, clearly outperforming both DeiT (86.84%) and BEiT (65.77%). This demonstrates the robustness of ViT in correctly classifying data. The F1 score, which provides a balance between precision and recall, is also highest for ViT at 0.860. DeiT closely follows with a score of 0.8007, showing its ability to maintain a good balance, while BEiT scores the lowest at 0.8408. Finally, the Pearson Correlation Coefficient (PCC), which measures the linear correlation between predictions and actual values, is highest for ViT at 0.9192 DeiT achieves a moderate PCC of 0.8311, while BEiT records the lowest value of 0.5660, indicating its limited ability to align predictions with ground truth.

Table 4.

Performance comparison by multi-modal models in the test dataset.

Model	Accuracy	F1 Score	PCC
ViT	0.872	0.860	0.9192
BEiT	0.6577	0.8408	0.5660
DeiT	0.8684	0.8007	0.8311

Open in a new tab

Overall, the results demonstrate that ViT outperforms both BEiT and DeiT across all metrics, excelling in efficiency, accuracy, and correlation with true values. DeiT shows reasonable performance, particularly in the F1 score, but remains behind ViT in other metrics. BEiT, despite leveraging unique pretraining strategies, underperforms significantly, especially in accuracy and PCC, likely due to its reliance on domain-specific data.

To evaluate the effectiveness of the proposed multimodal approach, a comparative analysis was conducted against two widely used machine learning models, LightGBM⁴³ and XGBoost⁴⁴. LightGBM is a gradient boosting framework that uses histogram-based learning to enhance training efficiency and reduce memory consumption, making it well-suited for large-scale datasets. XGBoost, another gradient boosting framework, employs a sparsity-aware algorithm and weighted quantile sketching to improve computational speed and model performance, particularly in handling missing values and imbalanced data. The proposed method integrates 2D chemical structure images and tabular (CSV) data using a Vision Transformer (ViT)-based architecture, whereas LightGBM and XGBoost rely solely on tabular data. These models were selected due to their extensive use in chemical property prediction tasks based on structured numerical features. For a fair comparison, hyperparameters for LightGBM and XGBoost were optimized through grid search. The performance of each model on the test dataset is summarized in Table 5.

Table 5.

Performance comparison by the proposed method (ViT), LightGBM, XGBoost in the test dataset.

Model	Accuracy	F1 Score	PCC
Proposed method (ViT)	0.872	0.860	0.9192
LightGBM	0.661	0.735	0.632
XGBoost	0.676	0.732	0.662

Open in a new tab

As shown in Table 5, the proposed multimodal ViT-based approach demonstrates superior performance compared to LightGBM and XGBoost across all evaluation metrics. The multimodal model achieves an accuracy of 0.872, significantly exceeding that of LightGBM (0.661) and XGBoost (0.676). Additionally, the F1 Score of 0.860 indicates a well-balanced classification performance, while the Pearson Correlation Coefficient (PCC) of 0.9192 reflects a strong correlation between predictions and ground truth, highlighting the model’s robustness.

The observed performance improvements can be attributed to the multimodal nature of the proposed approach, which effectively integrates structural and numerical information to enhance predictive accuracy. Unlike LightGBM and XGBoost, which rely solely on tabular data, the ViT-based model leverages chemical structure images, enabling it to capture complex spatial and relational patterns that are otherwise difficult to model using traditional feature-based methods.

Discussion

Contribution

To address the limitations of traditional machine learning techniques in predicting specific toxicity endpoints, this study integrates both numerical and image-based chemical data. The proposed deep learning framework combines the Vision Transformer (ViT) for molecular structure images and a Multilayer Perceptron (MLP) for numerical properties. This fusion-based approach enhances feature representation and predictive accuracy, mitigating the constraints posed by insufficient toxicity data. Furthermore, by employing a multi-label learning strategy, the model effectively generalizes across diverse toxicological endpoints, which is particularly beneficial given the fragmented nature of available datasets. While ensemble models are often employed to enhance generalization, they may not entirely resolve the challenges associated with data scarcity.

Averaging methods in multi-label classification

In our research, we opted for the samples averaging method because it aligns with our focus on evaluating the toxicity label prediction performance of each chemical compound sample individually. Specifically, we aimed to assess how accurately the model predicted multiple toxicity labels for each sample. Our experimental results indicated that the samples averaging method yielded superior performance compared to other averaging methods. This suggests that our model effectively predicts the toxicity labels for each chemical compound sample individually.

Dataset

The datasets used in this work were sourced from the Chemical Safety Information provided by Korea’s Ministry of Environment, which focuses on chemicals of emerging regulatory concern within the region. These substances have not been systematically studied in prior research, particularly in the context of deep learning for toxicity prediction. By prioritizing these datasets, our goal was to contribute novel insights into localized chemical risks, which global databases may not fully represent due to differences in regulatory frameworks or regional exposure patterns.

Importantly, the toxicity data was structured into six predefined symptom categories (General, Inhalation, Dermal, Ocular, Oral, Other) to align with Korea’s standardized chemical safety reporting guidelines. This intentional limitation allowed us to train models on symptom-specific toxicity profiles that reflect real-world regulatory needs, rather than aggregating heterogeneous data from diverse sources. While this approach narrows the scope, it ensures practical relevance for regional risk assessment and management.

Regarding molecular images, we complemented the Korea-specific toxicity labels with structural data crawled from PubChem and eChemPortal. This hybrid methodology ensured the accuracy of molecular representations while maintaining our focus on locally relevant toxicity profiles. We believe this integration of regionally curated toxicity data with globally validated structural information represents a unique strength of our dataset, balancing specificity and scientific rigor.

While it is true that larger datasets generally facilitate more robust model training, we employed several strategies to mitigate the limitations posed by the dataset’s size and achieve satisfactory performance. Firstly, we leveraged transfer learning by utilizing a pre-trained Vision Transformer (ViT) model. This approach allowed us to benefit from features learned on a large-scale image dataset, effectively compensating for the limited size of our own dataset. The pre-trained ViT model provided a strong foundation for feature extraction, which was then fine-tuned on our specific task. Secondly, we employed regularization methods, including dropout and weight decay, to prevent overfitting. These techniques helped to constrain the model’s complexity and improve its generalization ability. Thirdly, we utilized early stopping during the training process. By monitoring the model’s performance on a validation set, we were able to halt training when performance began to plateau or decline, preventing the model from overfitting to the training data. Fourthly, given the multi-label nature of our toxicity prediction task, we implemented multi-label learning strategies. This allowed us to train the model to predict multiple toxicity labels simultaneously, capturing the interdependencies between different toxicity categories. Lastly, we integrated numerical and categorical features alongside the image data. This multimodal approach provided the model with additional contextual information, further enhancing its predictive capabilities. Despite the dataset’s size, our approach effectively addressed the challenges of deep learning on limited data. We recognize that increasing the dataset size would likely lead to further improvements in model performance.

We applied zero normalization to the entire dataset. To ensure a more appropriate evaluation for imbalanced data, we additionally assessed model performance using balanced accuracy. The results demonstrated minimal variation (0.001–0.002) compared to accuracy, indicating a negligible impact on overall model evaluation. Furthermore, the dataset consists exclusively of toxic compounds, with no inclusion of non-toxic compounds. Consequently, classification between toxic and non-toxic compounds was not applicable in this study. A comprehensive list of toxic compounds has been provided in Table 8 of the “Appendix” for reference.

Sensitivity of multi-modal models to fusion order

To investigate the impact of fusion order when integrating molecular structure images and properties into 1D vectors, additional experiments were conducted by altering the sequence from “Image + Property” to “Property + Image.”

Table 6 presents the performance comparison of multi-modal models on the test dataset. The results indicate that BEiT exhibits an improved F1 score (0.8741) compared to its previous configuration, suggesting that it benefits from the modified fusion order, potentially due to its robust representation learning mechanism. In contrast, ViT and DeiT maintain relatively stable performance across different fusion orders, with only minor variations in accuracy, F1 score, and Pearson Correlation Coefficient (PCC). These findings highlight that the influence of fusion order on model performance varies depending on the architecture. BEiT demonstrates greater sensitivity to the fusion sequence, while ViT and DeiT appear to be less affected. This observation provides valuable insights for optimizing fusion strategies in multi-modal molecular property prediction tasks.

Table 6.

Performance comparison by multi-modal models with different fusion orders in the test dataset.

Model	Accuracy	F1 Score	PCC
ViT	0.871	0.866	0.9171
BEiT	0.866	0.8741	0.766
DeiT	0.866	0.8239	0.8251

Open in a new tab

Predefined rule for using molecular structure images

All images were first converted to the PNG format to preserve visual fidelity and maintain format uniformity. They were then resized to a fixed resolution of 224 × 224 pixels using bilinear interpolation, a choice driven by the input requirements of the Vision Transformer (ViT) architecture adopted in this work. To align with the preprocessing conventions of pre-trained vision models, images were converted to RGB color mode, and pixel values were normalized using the mean ([0.485, 0.456, 0.406]) and standard deviation ([0.229, 0.224, 0.225]) parameters widely employed in the computer vision community.

Robustness to image transformations in molecular representations

The Vision Transformer (ViT) processes images by dividing them into small patches and converting them into 1D vectors. Since this tokenization approach can be sensitive to transformations, ensuring consistency in molecular image orientations is crucial for stable representation learning.

To address this, we adopted standardized molecular depiction rules that generate 2D molecular structure images with fixed orientations. This standardization minimizes variations caused by arbitrary rotations and maintains structural consistency across samples. Additionally, the multimodal nature of the proposed method integrates molecular structure images with textual and numerical chemical features, enhancing overall robustness. By leveraging information from multiple modalities, the model reduces the impact of perturbations in any single input representation.

Impact of image rotation on model performance

Table 7 presents a comparison between the proposed method with and without image rotation applied. Interestingly, our experiments revealed that incorporating image rotation led to a decline in performance. While image augmentation techniques generally enhance performance in standard image datasets, we hypothesize that in the case of chemical molecular structure images, rotation alters the perception of molecular bonds, leading to misinterpretations and performance degradation.

Table 7.

Performance comparison of the proposed method with and without image rotation on the test dataset.

Model	Accuracy	F1 Score	PCC
Proposed method (ViT) without rotation	0.872	0.860	0.9192
Proposed method (ViT) with rotation	0.3031	0.8456	0.6299

Open in a new tab

Application domain

The accurate prediction of chemical toxicity is a critical challenge in interdisciplinary research, particularly in contexts requiring robust decision-making under constraints such as limited data availability and structural complexity. While the proposed multimodal deep learning framework demonstrates promising performance in multi-label toxicity prediction, it is essential to address the applicability domain (AD) of the model to ensure its reliability and interpretability in real-world scenarios.

The concept of an applicability domain, as emphasized in OECD guidelines⁴⁵ and related literature^46,47, defines the boundaries within which a predictive model operates with validated confidence. These boundaries are determined by the chemical space covered by the training data, the representativeness of molecular features, and the mechanistic relevance of the model to the endpoints being predicted. In this study, the integrated dataset-curated from diverse sources and normalized for structural, numerical, and toxicological consistency-aims to expand the chemical space coverage compared to domain-specific datasets (e.g., genotoxicity-only or acute toxicity-focused studies). However, the structural diversity of chemicals and the inherent biases in data sources (e.g., security-related restrictions) necessitate a rigorous definition of the AD to avoid extrapolation beyond the model’s validated scope.

To address this, the proposed framework incorporates two key AD-related considerations. First, the feature fusion mechanism combining Vision Transformer (ViT)-extracted molecular image embeddings and MLP-processed numerical descriptors inherently encodes chemical similarity metrics. This dual-stream architecture ensures that predictions are anchored in both structural and physicochemical property spaces, aligning with methodologies⁴⁸ for AD definition. Second, the model’s training protocol emphasizes chemical diversity through stratified sampling and data augmentation, reducing overfitting to narrow subspaces.

Furthermore, the multi-label prediction capability necessitates a toxicity endpoint-specific AD analysis. For instance, predictions for acute oral toxicity may rely more heavily on specific physicochemical descriptors (e.g., logP, molecular weight), while structural alerts identified via ViT-based image analysis could dominate predictions for genotoxicity. This aligns with the ICCVAM⁴⁹ recommendation for defining context-specific validity boundaries. Future work will involve implementing quantitative AD metrics, such as leverage-based approaches or distance-to-model measures, to provide explicit confidence intervals for individual predictions. In conclusion, while the proposed model achieves high accuracy (0.872) and strong correlation (PCC: 0.9192) in toxicity prediction, its real-world utility depends on transparently communicating the AD. Adherence to OECD and ICCVAM guidelines ensures that stakeholders can evaluate the model’s suitability for specific chemicals or regulatory purposes, ultimately enhancing trust in AI-driven toxicological assessments.

Limitation and future research

One major limitation of the proposed multimodal model is its applicability domain. Since the model integrates both image and structured data for classification, its prediction accuracy may be restricted when input data deviates significantly from the training distribution. This issue is inherent in deep learning models, as they rely on learned representations from a finite dataset and may struggle with out-of-distribution (OOD) samples.

In our study, the dataset used for training the classification model only includes labeled toxicity data, meaning that the model is constrained to making predictions within the provided dataset range. When a compound lacks a recorded toxicity endpoint, it is not explicitly assigned a 0 label; rather, the classification model predicts the most probable toxicity class based on learned patterns. This approach aligns with the fundamental nature of deep learning models, which do not inherently handle missing labels but instead infer the most probable category given the available information. While this method enables the model to generalize within the training distribution, it also highlights the need for strategies to handle missing or uncertain labels effectively.

To address this limitation, future research will focus on defining the applicability domain of the model and implementing mechanisms to detect and handle OOD inputs. Possible approaches include uncertainty estimation techniques such as Monte Carlo dropout and Bayesian neural networks, as well as distance-based methods like Mahalanobis distance or outlier detection algorithms. By incorporating these strategies, we aim to enhance the model’s robustness and reliability when dealing with unseen or atypical data.

Additionally, expanding the dataset to include a more diverse range of chemical substances will help improve the generalizability of the model. We also plan to explore domain adaptation techniques to mitigate performance degradation when applied to new chemical categories. Despite these limitations, our study presents a novel contribution by introducing a multimodal classification approach in the field of chemical toxicity prediction. We believe that further research in applicability domain estimation will complement our proposed method and contribute to its practical deployment in real-world scenarios.

Conclusion

In this paper, we proposed a multi-modal deep learning framework for chemical toxicity prediction, combining molecular structure images with chemical property data to improve predictive accuracy. The experimental results validated the superiority of our approach, particularly the Vision Transformer (ViT) model. Among the three models tested-ViT, BEiT, and DeiT-ViT consistently outperformed the others across all key metrics. In terms of accuracy, ViT achieved an impressive 87.2%, surpassing DeiT (86.84%) and BEiT (65.77%). The model also excelled in F1 score (0.86) and Pearson Correlation Coefficient (PCC) (0.9192), further confirming its superior performance. These findings highlight the ability of ViT to effectively handle multi-modal data, combining both image-based features and numerical data for more accurate chemical toxicity predictions. The ability to predict multiple toxicity endpoints, including general toxicity, dermal toxicity, ocular toxicity, and oral toxicity, further strengthens the model’s applicability in various real-world contexts, such as environmental safety and public health.

Looking forward, this research opens several avenues for future studies. One potential direction is the exploration of additional data modalities, such as toxicological data from clinical studies or environmental exposure data, which could further enhance the model’s performance. Incorporating these data sources might allow for more nuanced predictions and a deeper understanding of the relationships between chemical structures and their toxicological impacts. Another area for improvement lies in expanding the model’s generalizability. While our framework demonstrated exceptional results on the dataset used, its applicability to other chemical domains or novel substances requires further testing. Future work may involve fine-tuning the model through transfer learning techniques, allowing it to adapt to new datasets with limited data. Furthermore, integrating real-time sensor data from wearable devices for continuous toxicity monitoring could pave the way for dynamic and on-site toxicity assessments in both civilian and military applications. Overall, while our study demonstrates a significant step forward in toxicity prediction, there remains much room for growth in refining and scaling the framework to address broader challenges in toxicological research and safety management.

Appendix

In Table 8, each of the 4179 data entries contains information on six symptom-related columns: General Symptoms, Inhalation, Dermal, Ocular, Oral, and Other. Each symptom column may include multiple symptoms per data entry.

To clarify, the positive count refers to the number of times a particular symptom appears within its respective category, while the negative count is obtained by subtracting the positive occurrences from the total number of data points (4179).

For further illustration, consider a single data entry (Sample A). This sample may exhibit the following symptoms across different categories:

General Symptoms: Irritation, Headache, Vomiting
Inhalation: Death, Coma, Edema, Dizziness
Dermal: Blisters, Edema
Ocular: Irritation, Congestion, Corneal Damage, Tearing
Oral: Vomiting
Other: Coma

Table 8 provides a summary of the toxicity endpoints across these six symptom categories, detailing the number of unique symptom labels and the top 10 most frequently occurring symptoms within each category. For instance, in the General Symptoms category, there are 50 unique symptom labels, with the most frequently occurring symptoms including Irritation (844 occurrences), Burns (381 occurrences), Dizziness (212 occurrences), etc.

Funding

This study was supported by Future Strategy and Technology Research Institute of the Korea Military Academy (2025-NWMD-05) and the National Research Foundation of Korea (NRF) grant, funded by the Korean government (MSIT) (RS-2025-00516065).

Data availability

The data used to support the findings of this study will be available from the corresponding author upon request after acceptance.

Declarations

Competing interests

The authors declare that there are no conflicts of interest regarding the publication of this article.

Ethical approval

All authors give ethical and informed consent.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Pérez Santín, E. et al. Toxicity prediction based on artificial intelligence: A multidisciplinary overview. Wiley Interdisc. Rev. Comput. Mole. Sci.11(5), 1516 (2021). [Google Scholar]
2.Tran, T. T. V., Surya Wibowo, A., Tayara, H. & Chong, K. T. Artificial intelligence in drug toxicity prediction: Recent advances, challenges, and future perspectives. J. Chem. Inf. Model.63(9), 2628–2643 (2023). [DOI] [PubMed] [Google Scholar]
3.Cavasotto, C. N. & Scardino, V. Machine learning toxicity prediction: Latest advances by toxicity end point. ACS Omega7(51), 47536–47546 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Wu, Z. et al. Mining toxicity information from large amounts of toxicity data. J. Med. Chem.64(10), 6924–6936 (2021). [DOI] [PubMed] [Google Scholar]
5.Sharma, N., Naorem, L. D., Jain, S. & Raghava, G. P. Toxinpred2: An improved method for predicting toxicity of proteins. Brief. Bioinform.23(5), bbac174 (2022). [DOI] [PubMed] [Google Scholar]
6.F. Berg and S. Kappler, Future biological and chemical weapons. In Ciottone’s Disaster Medicine, pp. 520–530 (Elsevier, 2024).
7.Gustavsson, M. et al. Transformers enable accurate prediction of acute and chronic chemical toxicity in aquatic organisms. Sci. Adv.10(10), 6669 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Stucki, A. O. et al. Use of new approach methodologies (NAMS) to meet regulatory requirements for the assessment of industrial chemicals and pesticides for effects on human health. Front. Toxicol.4, 964553 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Syafrudin, M. et al. Pesticides in drinking water: A review. Int. J. Environ. Res. Public Health18(2), 468 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Lin, Z. & Chou, W.-C. Machine learning and artificial intelligence in toxicological sciences. Toxicol. Sci.189(1), 7–19 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Badwan, B. A., Liaropoulos, G., Kyrodimos, E., Skaltsas, D., Tsirigos, A., Gorgoulis, V.G.: Machine learning approaches to predict drug efficacy and toxicity in oncology. Cell Rep. Methods, 3(2) (2023). [DOI] [PMC free article] [PubMed]
12.Huang, T. et al. Quantitative structure-activity relationship (QSAR) studies on the toxic effects of nitroaromatic compounds (NACS): A systematic review. Int. J. Mol. Sci.22(16), 8557 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Kurani, A., Doshi, P., Vakharia, A. & Shah, M. A comprehensive comparative study of artificial neural network (ANN) and support vector machines (SVM) on stock forecasting. Ann. Data Sci.10(1), 183–208 (2023). [Google Scholar]
14.Khajavi, H. & Rastgoo, A. Predicting the carbon dioxide emission caused by road transport using a random forest (rf) model combined by meta-heuristic algorithms. Sustain. Cities Soc.93, 104503 (2023). [Google Scholar]
15.Isnain, A. R., Supriyanto, J. & Kharisma, M. P. Implementation of k-nearest neighbor (k-nn) algorithm for public sentiment analysis of online learning. Indones. J. Comput. Cybern. Syst.15(2), 121–130 (2021). [Google Scholar]
16.Trinh, T. X., Seo, M., Yoon, T. H. & Kim, J. Developing random forest based QSAR models for predicting the mixture toxicity of tio2 based nano-mixtures to daphnia magna. NanoImpact25, 100383 (2022). [DOI] [PubMed] [Google Scholar]
17.Sharma, B. et al. Accurate clinical toxicity prediction using multi-task deep neural nets and contrastive molecular explanations. Sci. Rep.13(1), 4908 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Sun, M. et al. Graph convolutional networks for computational drug development and discovery. Brief. Bioinform.21(3), 919–935 (2020). [DOI] [PubMed] [Google Scholar]
19.Schwartz, J., Awale, M. & Reymond, J.-L. Smifp (smiles fingerprint) chemical space for virtual screening and visualization of large databases of organic molecules. J. Chem. Inf. Model.53(8), 1979–1989 (2013). [DOI] [PubMed] [Google Scholar]
20.Hirohara, M., Saito, Y., Koda, Y., Sato, K. & Sakakibara, Y. Convolutional neural network based on smiles representation of compounds for detecting chemical motif. BMC Bioinform.19, 83–94 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Schneider, M., Pons, J.-L., Labesse, G. & Bourguet, W. In silico predictions of endocrine disruptors properties. Endocrinology160(11), 2709–2716 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Liu, P., Ren, Y., Tao, J. & Ren, Z. Git-mol: A multi-modal large language model for molecular science with graph, image, and text. Comput. Biol. Med.171, 108073 (2024). [DOI] [PubMed] [Google Scholar]
23.D. Alexey, An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929 (2020)
24.Imambi, S., Prakash, K.B., Kanagachidambaresan, G. Pytorch, Programming with TensorFlow: Solution for Edge Computing Applications, pp. 87–104 (2021).
25.Pang, B., Nijkamp, E. & Wu, Y. N. Deep learning with tensorflow: A review. J. Educ. Behav. Stat.45(2), 227–248 (2020). [Google Scholar]
26.Parker, N. A. Rapid and Spatially Explicit Assessment of Contaminants of Emerging Concern in Data Limited Watersheds (University of California, Santa Barbara, 2023). [Google Scholar]
27.Ma, H. et al. Unveiling the structure-surface energy relationship of zeolites through machine learning. J. Phys. Chem. C128(36), 14927–14936 (2024). [Google Scholar]
28.Guo, J., Woo, V., Andersson, D.A., Hoyt, N., Williamson, M., Foster, I., Benmore, C., Jackson, N.E., Sivaraman, G.: Al4gap: Active learning workflow for generating dft-scan accurate machine-learning potentials for combinatorial molten salt mixtures. J. Chem. Phys., 159(2) (2023). [DOI] [PubMed]
29.Rostamkhani, N. et al. Enhanced anti-tumor and anti-metastatic activity of quercetin using ph-sensitive alginate@ zif-8 nanocomposites: in vitro and in vivo study. Nanotechnology35(47), 475102 (2024). [DOI] [PubMed] [Google Scholar]
30.Galatro, D., Dawe, S.: Data-based modelling for prediction. In Data Analytics for Process Engineers: Prediction, Control and Optimization, pp. 59–105 (Springer, 2023).
31.Jaafar, S. M. & Sukri, R. S. Data on the physicochemical characteristics and texture classification of soil in Bornean tropical heath forests affected by exotic acacia mangium. Data Brief51, 109670 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Tetley, M.J. The Distribution, Ecological Niche Modelling and Habitat Suitability Mapping of the Minke Whale (Balaenoptera acutorostrata) within the North Atlantic. Bangor University (United Kingdom) (2010).
33.Nitharshni, J., Nilasruthy, R., Shakthi Akshaiya, K. & Rajavel, M. Quality check of water for human consumption using machine learning. Adv. Sci. Technol.124, 574–589 (2023). [Google Scholar]
34.Baum, Z. J. et al. Artificial intelligence in chemistry: Current trends and future directions. J. Chem. Inf. Model.61(7), 3197–3212 (2021). [DOI] [PubMed] [Google Scholar]
35.Kang, Y., Cai, Z., Tan, C.-W., Huang, Q. & Liu, H. Natural language processing (NLP) in management research: A literature review. J. Manage. Anal.7(2), 139–172 (2020). [Google Scholar]
36.Bisong, E.: Introduction to scikit-learn. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, pp. 215–229 (Springer, 2019).
37.Guan, L.: Weight prediction boosts the convergence of adamw. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 329–340 (Springer, 2023).
38.Yacouby, R., Axman, D.: Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pp. 79–91 (2020).
39.Benesty, J., Chen, J. & Huang, Y. On the importance of the Pearson correlation coefficient in noise reduction. IEEE Trans. Audio Speech Lang. Process.16(4), 757–765 (2008). [Google Scholar]
40.Yue, X., Sun, S., Kuang, Z., Wei, M., Torr, P.H., Zhang, W., Lin, D.: Vision transformer with progressive sampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 387–396 (2021).
41.Bao, H., Dong, L., Piao, S., Wei, F.: Beit: Bert pre-training of image transformers, arXiv preprint arXiv:2106.08254 (2021).
42.Touvron, H., Cord, M., Jégou, H.: Deit III: Revenge of the vit. In European Conference on Computer Vision, pp. 516–533 (Springer, 2022).
43.Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y.: Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst.30 (2017).
44.Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016).
45.O. for Economic Co-operation and Development, Guidance document on the validation of (quantitative) structure-activity relationship [(Q) SAR] models. Organisation for Economic Co-operation and Development (2014).
46.Hanser, T., Barber, C., Marchaland, J. & Werner, S. Applicability domain: Towards a more formal definition. SAR QSAR Environ. Res.27(11), 865–881 (2016). [DOI] [PubMed] [Google Scholar]
47.Kar, S., Roy, K., Leszczynski, J.: Applicability domain: A step toward confident predictions and decidability for qsar modeling. Comput. Toxicol. Methods Protocols, pp. 141–169 (2018). [DOI] [PubMed]
48.Sahigara, F. et al. Comparison of different approaches to define the applicability domain of QSAR models. Molecules17(5), 4791–4810 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Validation, Q.: Regulatory acceptance of new approach methodologies. In A Report of the Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) Validation Workgroup (2023).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data used to support the findings of this study will be available from the corresponding author upon request after acceptance.

[CR1] 1.Pérez Santín, E. et al. Toxicity prediction based on artificial intelligence: A multidisciplinary overview. Wiley Interdisc. Rev. Comput. Mole. Sci.11(5), 1516 (2021). [Google Scholar]

[CR2] 2.Tran, T. T. V., Surya Wibowo, A., Tayara, H. & Chong, K. T. Artificial intelligence in drug toxicity prediction: Recent advances, challenges, and future perspectives. J. Chem. Inf. Model.63(9), 2628–2643 (2023). [DOI] [PubMed] [Google Scholar]

[CR3] 3.Cavasotto, C. N. & Scardino, V. Machine learning toxicity prediction: Latest advances by toxicity end point. ACS Omega7(51), 47536–47546 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Wu, Z. et al. Mining toxicity information from large amounts of toxicity data. J. Med. Chem.64(10), 6924–6936 (2021). [DOI] [PubMed] [Google Scholar]

[CR5] 5.Sharma, N., Naorem, L. D., Jain, S. & Raghava, G. P. Toxinpred2: An improved method for predicting toxicity of proteins. Brief. Bioinform.23(5), bbac174 (2022). [DOI] [PubMed] [Google Scholar]

[CR6] 6.F. Berg and S. Kappler, Future biological and chemical weapons. In Ciottone’s Disaster Medicine, pp. 520–530 (Elsevier, 2024).

[CR7] 7.Gustavsson, M. et al. Transformers enable accurate prediction of acute and chronic chemical toxicity in aquatic organisms. Sci. Adv.10(10), 6669 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Stucki, A. O. et al. Use of new approach methodologies (NAMS) to meet regulatory requirements for the assessment of industrial chemicals and pesticides for effects on human health. Front. Toxicol.4, 964553 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Syafrudin, M. et al. Pesticides in drinking water: A review. Int. J. Environ. Res. Public Health18(2), 468 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Lin, Z. & Chou, W.-C. Machine learning and artificial intelligence in toxicological sciences. Toxicol. Sci.189(1), 7–19 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Badwan, B. A., Liaropoulos, G., Kyrodimos, E., Skaltsas, D., Tsirigos, A., Gorgoulis, V.G.: Machine learning approaches to predict drug efficacy and toxicity in oncology. Cell Rep. Methods, 3(2) (2023). [DOI] [PMC free article] [PubMed]

[CR12] 12.Huang, T. et al. Quantitative structure-activity relationship (QSAR) studies on the toxic effects of nitroaromatic compounds (NACS): A systematic review. Int. J. Mol. Sci.22(16), 8557 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Kurani, A., Doshi, P., Vakharia, A. & Shah, M. A comprehensive comparative study of artificial neural network (ANN) and support vector machines (SVM) on stock forecasting. Ann. Data Sci.10(1), 183–208 (2023). [Google Scholar]

[CR14] 14.Khajavi, H. & Rastgoo, A. Predicting the carbon dioxide emission caused by road transport using a random forest (rf) model combined by meta-heuristic algorithms. Sustain. Cities Soc.93, 104503 (2023). [Google Scholar]

[CR15] 15.Isnain, A. R., Supriyanto, J. & Kharisma, M. P. Implementation of k-nearest neighbor (k-nn) algorithm for public sentiment analysis of online learning. Indones. J. Comput. Cybern. Syst.15(2), 121–130 (2021). [Google Scholar]

[CR16] 16.Trinh, T. X., Seo, M., Yoon, T. H. & Kim, J. Developing random forest based QSAR models for predicting the mixture toxicity of tio2 based nano-mixtures to daphnia magna. NanoImpact25, 100383 (2022). [DOI] [PubMed] [Google Scholar]

[CR17] 17.Sharma, B. et al. Accurate clinical toxicity prediction using multi-task deep neural nets and contrastive molecular explanations. Sci. Rep.13(1), 4908 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Sun, M. et al. Graph convolutional networks for computational drug development and discovery. Brief. Bioinform.21(3), 919–935 (2020). [DOI] [PubMed] [Google Scholar]

[CR19] 19.Schwartz, J., Awale, M. & Reymond, J.-L. Smifp (smiles fingerprint) chemical space for virtual screening and visualization of large databases of organic molecules. J. Chem. Inf. Model.53(8), 1979–1989 (2013). [DOI] [PubMed] [Google Scholar]

[CR20] 20.Hirohara, M., Saito, Y., Koda, Y., Sato, K. & Sakakibara, Y. Convolutional neural network based on smiles representation of compounds for detecting chemical motif. BMC Bioinform.19, 83–94 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Schneider, M., Pons, J.-L., Labesse, G. & Bourguet, W. In silico predictions of endocrine disruptors properties. Endocrinology160(11), 2709–2716 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Liu, P., Ren, Y., Tao, J. & Ren, Z. Git-mol: A multi-modal large language model for molecular science with graph, image, and text. Comput. Biol. Med.171, 108073 (2024). [DOI] [PubMed] [Google Scholar]

[CR23] 23.D. Alexey, An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929 (2020)

[CR24] 24.Imambi, S., Prakash, K.B., Kanagachidambaresan, G. Pytorch, Programming with TensorFlow: Solution for Edge Computing Applications, pp. 87–104 (2021).

[CR25] 25.Pang, B., Nijkamp, E. & Wu, Y. N. Deep learning with tensorflow: A review. J. Educ. Behav. Stat.45(2), 227–248 (2020). [Google Scholar]

[CR26] 26.Parker, N. A. Rapid and Spatially Explicit Assessment of Contaminants of Emerging Concern in Data Limited Watersheds (University of California, Santa Barbara, 2023). [Google Scholar]

[CR27] 27.Ma, H. et al. Unveiling the structure-surface energy relationship of zeolites through machine learning. J. Phys. Chem. C128(36), 14927–14936 (2024). [Google Scholar]

[CR28] 28.Guo, J., Woo, V., Andersson, D.A., Hoyt, N., Williamson, M., Foster, I., Benmore, C., Jackson, N.E., Sivaraman, G.: Al4gap: Active learning workflow for generating dft-scan accurate machine-learning potentials for combinatorial molten salt mixtures. J. Chem. Phys., 159(2) (2023). [DOI] [PubMed]

[CR29] 29.Rostamkhani, N. et al. Enhanced anti-tumor and anti-metastatic activity of quercetin using ph-sensitive alginate@ zif-8 nanocomposites: in vitro and in vivo study. Nanotechnology35(47), 475102 (2024). [DOI] [PubMed] [Google Scholar]

[CR30] 30.Galatro, D., Dawe, S.: Data-based modelling for prediction. In Data Analytics for Process Engineers: Prediction, Control and Optimization, pp. 59–105 (Springer, 2023).

[CR31] 31.Jaafar, S. M. & Sukri, R. S. Data on the physicochemical characteristics and texture classification of soil in Bornean tropical heath forests affected by exotic acacia mangium. Data Brief51, 109670 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Tetley, M.J. The Distribution, Ecological Niche Modelling and Habitat Suitability Mapping of the Minke Whale (Balaenoptera acutorostrata) within the North Atlantic. Bangor University (United Kingdom) (2010).

[CR33] 33.Nitharshni, J., Nilasruthy, R., Shakthi Akshaiya, K. & Rajavel, M. Quality check of water for human consumption using machine learning. Adv. Sci. Technol.124, 574–589 (2023). [Google Scholar]

[CR34] 34.Baum, Z. J. et al. Artificial intelligence in chemistry: Current trends and future directions. J. Chem. Inf. Model.61(7), 3197–3212 (2021). [DOI] [PubMed] [Google Scholar]

[CR35] 35.Kang, Y., Cai, Z., Tan, C.-W., Huang, Q. & Liu, H. Natural language processing (NLP) in management research: A literature review. J. Manage. Anal.7(2), 139–172 (2020). [Google Scholar]

[CR36] 36.Bisong, E.: Introduction to scikit-learn. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, pp. 215–229 (Springer, 2019).

[CR37] 37.Guan, L.: Weight prediction boosts the convergence of adamw. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 329–340 (Springer, 2023).

[CR38] 38.Yacouby, R., Axman, D.: Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pp. 79–91 (2020).

[CR39] 39.Benesty, J., Chen, J. & Huang, Y. On the importance of the Pearson correlation coefficient in noise reduction. IEEE Trans. Audio Speech Lang. Process.16(4), 757–765 (2008). [Google Scholar]

[CR40] 40.Yue, X., Sun, S., Kuang, Z., Wei, M., Torr, P.H., Zhang, W., Lin, D.: Vision transformer with progressive sampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 387–396 (2021).

[CR41] 41.Bao, H., Dong, L., Piao, S., Wei, F.: Beit: Bert pre-training of image transformers, arXiv preprint arXiv:2106.08254 (2021).

[CR42] 42.Touvron, H., Cord, M., Jégou, H.: Deit III: Revenge of the vit. In European Conference on Computer Vision, pp. 516–533 (Springer, 2022).

[CR43] 43.Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y.: Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst.30 (2017).

[CR44] 44.Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016).

[CR45] 45.O. for Economic Co-operation and Development, Guidance document on the validation of (quantitative) structure-activity relationship [(Q) SAR] models. Organisation for Economic Co-operation and Development (2014).

[CR46] 46.Hanser, T., Barber, C., Marchaland, J. & Werner, S. Applicability domain: Towards a more formal definition. SAR QSAR Environ. Res.27(11), 865–881 (2016). [DOI] [PubMed] [Google Scholar]

[CR47] 47.Kar, S., Roy, K., Leszczynski, J.: Applicability domain: A step toward confident predictions and decidability for qsar modeling. Comput. Toxicol. Methods Protocols, pp. 141–169 (2018). [DOI] [PubMed]

[CR48] 48.Sahigara, F. et al. Comparison of different approaches to define the applicability domain of QSAR models. Molecules17(5), 4791–4810 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Validation, Q.: Regulatory acceptance of new approach methodologies. In A Report of the Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) Validation Workgroup (2023).

PERMALINK

Multimodal deep learning for chemical toxicity prediction and management

Jiwon Hong

Hyun Kwon

Abstract

Introduction

Related work

Traditional machine learning approaches

Deep learning for toxicity prediction

Multi-modal toxicity prediction models

Proposed method

Fig. 1.

Model architecture

Image processing backbone: vision transformer (ViT)

Tabular data processing: multi-layer perceptron (MLP)

Table 8.

Fusion layer: joint fusion of features

Toxicity prediction module: fully connected layer

Thresholding for prediction

Algorithm 1.

Training objective

Experimental setup and results

Experimental setup

Fig. 2.

Table 1.

Table 2.

Table 3.

Experimental results

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Table 4.

Table 5.

Discussion

Contribution

Averaging methods in multi-label classification

Dataset

Sensitivity of multi-modal models to fusion order

Table 6.

Predefined rule for using molecular structure images

Robustness to image transformations in molecular representations

Impact of image rotation on model performance

Table 7.

Application domain

Limitation and future research

Conclusion

Appendix

Funding

Data availability

Declarations

Competing interests

Ethical approval

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases